A router for my local models

A machine at home

I run an Apple Silicon Mac Studio at home, which means a large unified memory pool that can hold substantial model weights in fast memory. It runs llama.cpp as an inference server, and over time I have pointed a lot of things at it: coding harnesses, scripts, personal projects, experiments. It works well. Local inference is fast enough for most tasks, private by default, and does not accumulate API costs.

But the setup was fragmented. Every tool I used needed its own configuration: the endpoint URL, the model names, sometimes quirks in how the API was called. When I added a new model or changed a name, I updated each tool separately. When I wanted to share access with someone else, I gave them the raw endpoint and hoped the firewall was configured correctly. There was no central point of control, no visibility into what was running, and no real answer to the question of what happened when requests overlapped.

The idle machine

The problem that pushed me toward building something was simple: I was away from home, I needed to use my coding harness, and the model was sitting idle while I cobbled together a connection that mostly did not work.

The Mac Studio does not have a static public IP. Getting to it remotely meant either setting up a VPN, port-forwarding through a router, or accepting that I would use a hosted API fallback and come back to the local setup when I got home. None of these felt like a real solution. I wanted to use the capacity I already had, from wherever I was, without manual workarounds each time.

There were other motivations underneath this. I wanted to understand the model serving process better, not just call an API but actually understand how requests queue, how streaming works at the protocol level, how you handle a slow inference run without dropping the client connection. And I wanted a setup that could eventually scale: where friends or family could share capacity, where I could pool resources rather than have each machine sit isolated.

llmesh is what I built to address all of this.

What llmesh does

At its simplest, llmesh is a router. It accepts requests in OpenAI or Anthropic API format and dispatches them to connected inference workers. From the perspective of any tool using it, it looks like a standard LLM API endpoint. The worker handling the actual inference is invisible.

The architecture has three parts.

The router runs on my VPS. It is the central coordination point: it validates API keys, queues incoming requests, dispatches jobs to available workers, and streams responses back to clients. It runs as a small container at around 64 MB of RAM and does no inference itself.

The clients run on inference machines, including my Mac Studio. Each client is a small agent that connects outbound to the router over WebSocket and registers the models it can serve. When the router has a job matching a model the client supports, it dispatches the job over the WebSocket connection. The client forwards it to the local llama.cpp server and streams results back through the same connection.

The tools connect to the router’s endpoint. My coding harness, scripts, personal projects: they all point at the same address and do not know or care which machine is actually running inference.

The outbound connection

The decision to use outbound WebSocket connections from clients to the router was the architectural choice that made everything else easier.

The alternative, having the router connect inbound to each worker, requires each inference machine to be publicly reachable. That means static IPs, port-forwarding, or VPN setup on every machine. It makes sharing capacity with others significantly more complicated.

With outbound connections, the inference machine only needs to be able to reach the router’s public address. No inbound firewall rules. No port-forwarding. My Mac Studio connects out to the VPS and holds the connection open. When I make a request from a different machine entirely, the router uses the already-open connection to dispatch the job.

This also means adding a new worker is trivial. Configure the client with the router’s address and a token, run it, and it registers. No router-side configuration changes are needed.

The scheduler

Once multiple workers are connected, the router needs to decide which one handles each request. The scheduler is one of the more interesting parts of llmesh.

Requests are queued in a three-tier priority structure: high, normal, and low, with FIFO within each tier. Most requests come in at normal priority. The system exposes higher and lower priorities for cases where I want to ensure a particular task gets resources quickly, or explicitly want to run something in the background without competing with interactive work.

The more interesting part is owner affinity. Each worker has a concept of an owner, the person whose machine it is. When the owner makes a request, the scheduler preferentially routes it to their own hardware. This matters in a shared setup: if I have shared capacity with someone, their requests can use my machine when it is idle, but my own requests do not get delayed by theirs. The owner gets guaranteed priority on their own machine.

Model aliasing sits on top of this. I might have the same model running on two different machines with slightly different quantisations or context lengths. In the router’s config, I define a logical alias that maps to both. The scheduler picks whichever underlying model and worker is most appropriate based on load and affinity. The client making the request just uses the alias name and does not need to know about the underlying hardware.

Building it: keeping connections alive

The hardest problem I ran into was not the routing logic or the scheduler. It was keeping connections alive for the duration of a long inference run.

Local models are slow compared to hosted APIs. A complex prompt might take several minutes to produce a full response. During that time, the connection chain looks like: tool to router over HTTP/SSE, router to client over WebSocket, client to llama.cpp over HTTP. Every link in that chain has its own timeout, and proxies along the way will drop connections that appear idle.

The inference side was streaming tokens as they were generated, so the outward-facing SSE connection stayed alive as long as tokens were flowing. But for slow models generating tokens at, say, one per second, the gaps between tokens were long enough to trigger proxy timeouts in practice.

The solution was a keepalive layer that operates independently of the token stream. The router sends SSE comment lines on a timer while a response is in progress, which keeps the HTTP connection to the client tool alive without the tool needing to process them. On the WebSocket side between router and client, the standard WebSocket ping/pong mechanism keeps that link alive. The client does the same toward llama.cpp.

Getting the timing right required experimenting. Cloudflare and other proxies have different idle timeouts. The keepalive interval needed to be short enough to survive aggressive proxies but not so frequent that it added meaningful overhead. The current default is 15 seconds for the SSE keep-alive, which has been reliable across the setups I have tested.

Lease management and failure modes

In a distributed system, machines disconnect. A client might drop its WebSocket connection mid-inference due to a power cycle, network blip, or process restart. Without handling this, the router would wait forever for a response from a worker that is gone.

llmesh handles this with a lease model. When the router dispatches a job to a client, it sets a lease timer. If the client does not complete the job before the lease expires, the router reclaims the job and re-queues it. The requesting tool either waits for the retry or gets an error depending on how the timeouts stack.

The tricky part is deduplication. When a job is retried, the router cannot send the same job ID to a different client without risking double-processing. llmesh tracks in-flight job IDs and rejects duplicates at the dispatch layer.

There are several timeout layers at play. Time to first token covers the total wait from request to first response byte. An activity timeout aborts streams that have gone silent for too long, catching cases where the worker is alive but inference has stalled. A batch timeout handles non-streaming requests separately. Each of these needed to be tuned against the slowest realistic scenario: a large model on modest hardware, under load, behind a proxy with opinions about connection idle time.

This is one of those areas where getting it right required thinking through failure modes I had not actually seen in practice yet. A machine disconnecting mid-inference is rare. But when it happens, the behaviour should be predictable rather than a stuck request that never resolves.

The admin portal

There is a web interface at for managing the practical side of running a shared router: creating and revoking API keys, managing client tokens, configuring model aliases, and seeing what is connected and what is queued.

I do not use it constantly, but it is where I go when I want to onboard someone new or check that everything is behaving as expected. The alternative, editing config files directly and restarting the service, works but is friction I did not want in something designed to run without regular maintenance attention.

Now

All of my LLM requests route through llmesh now. My coding harness, personal scripts, and projects all point at the same endpoint on my VPS. The Mac Studio handles inference when it is available and connected; requests I make from anywhere go through the same queue.

The utilisation picture is better than it was. Before, the Mac Studio was either fully in use or sitting idle with no visibility into which. Now I can see queue depth, which requests are waiting, and how capacity is being used. I have started prioritising tasks explicitly rather than hoping the model responds before a timeout.

I am starting to build integrations into personal projects, which is the part I am most interested in. Having a single endpoint with known behaviour makes it straightforward to add model calls to an application without wiring up separate API credentials and endpoint configs each time.

Building for yourself

There is a particular satisfaction in building infrastructure that runs quietly and just works. llmesh is not solving a problem without existing solutions. There are hosted routers, cloud APIs, and other open-source options. But building it meant understanding the problem well enough to make real decisions at every layer: what the scheduler should optimise for, what a failure should look like to the client, where state should live.

The self-hosted angle matters too. Running local models is partly about cost and privacy, but it is also about legibility. I know what is handling my requests, where the weights live, what the queue looks like, and why a particular request took longer than expected. llmesh is one more layer in a stack I have chosen to own rather than rent.

Building something you use every day is also a good way to find out what is actually important in a design. The keepalive work was not in any design document. It became important the first time I tried to use a slow model over a proxy that had opinions about idle connections. That kind of feedback loop is hard to get any other way.