Thor, Six Weeks On: Building a Production-Grade AI Stack at Home

In the first Thor post I stood up a private AI server on an NVIDIA Jetson AGX Thor — 128GB of unified memory, a Blackwell GPU — and ran three inference backends side by side: Ollama, vLLM, and a hand-compiled TensorRT-LLM engine, all behind one OpenAI-compatible API, with a dashboard showing live tok/s. I was pretty pleased with it.

Six weeks of actually using it reframed the entire project for me. The inference was the fun part — the exploration. But the thing I’m actually building isn’t a chatbot on a Jetson. It’s a production-grade AI platform that happens to run in my house: the same disciplines I’d demand of any system serving real traffic at work — observability and scoring, caching, token-cost optimization, evaluators, guardrails, and end-to-end auditability — applied to a stack where nothing touches the cloud and I own every layer.

This post is about that reframing. It’s also an honest accounting of what I deleted to get there, starting with the headline: two of the three inference backends I bragged about are gone, and almost everything that matters now lives in the platform layers I hadn’t built yet.


The thesis: production discipline, self-hosted

At work, “is this service production-ready” isn’t a vibe — it’s a checklist. Can you see what it’s doing? Can you reproduce a failure? Does it degrade gracefully? Can you prove what it did after the fact? I’ve spent years scoring services against exactly that kind of operational-maturity rubric. The interesting question this build asks is: what does that rubric look like for an LLM platform, and can I hit it at home with no cloud dependency?

Six pillars, the way I’m designing them:

PillarWhat it means for an AI stackWhere it lives on Thor
Observability + scoringSee every prompt/completion, token count, latency, and call tree — and grade the outputsLangfuse, fed by the gateway
CachingDon’t pay (in tokens or watts) to recompute an identical callLiteLLM + Redis
Token optimizationSpend the smallest context and the cheapest model that does the jobHeadroom proxy, role routing, KV-cache quant
EvaluatorsRegression-test model and prompt changes against fixed tasks, not vibesHermes eval runners
GuardrailsAccess control, graceful degradation, and proof a tool actually ranVirtual keys, fallback policy, verification probes
AuditabilityReconstruct exactly what happened, after the factLangfuse traces + SOPS-managed secrets

Each section below is one of those rows. None of it needed new hardware — same Jetson, same Blackwell GPU from April. What changed is that I got serious about the stuff around the model.


The reversal: three backends down to one

The original post’s centerpiece was running Ollama, vLLM, and TensorRT-LLM simultaneously, each as a separate model in the dropdown. It was a genuinely fun engineering exercise — patching the TRT build for Blackwell’s sm_110, wrapping the llm_inference binary in an aiohttp shim, tuning vLLM’s KV-cache reservation down to 8% so all three could coexist.

Then I tried to operate it, and the three-backend design failed the production test on its first question: what is this complexity buying me? All three were serving roughly the same 7B at roughly the same quality. The backend diversity solved a problem I didn’t have (benchmarking inference engines) at the cost of the one I did (orchestrating a useful set of models under a maturity rubric). Three things to monitor, three failure modes, three sets of metrics — for no production value.

So I pulled vLLM and TRT. The git history records the moment as vLLM revert (close-out). Ollama became the single model server, and everything I built afterward went in front of it, where the production pillars actually live.

The unified-memory wall I said I’d never hit

In the first post I claimed the 128GB pool meant you “never hit the out-of-VRAM wall.” True for one 7B. Emphatically false when you keep several 30B-class models warm. With OLLAMA_MAX_LOADED_MODELS=5, the first time a coder (30B), a reasoner (35B MoE), and a couple of smaller models were all resident and a request hit a cold sixth, the GPU OOM’d mid-load and took the warm set with it. A production stack fails predictably, so I capped it:

OLLAMA_MAX_LOADED_MODELS=2   # was 5

Two warm 30B-class models is the real ceiling at useful context lengths. Unified memory removes the VRAM-vs-RAM wall. It does not remove the wall — and pretending otherwise is how you get a 2am OOM instead of a clean queue.


The platform, as a stack of pillars

Same host, same Ollama at the bottom, but the center of gravity moved up:

Thor’s request path: Hermes on top, flowing down through the Headroom compression proxy and the LiteLLM gateway, which fans out to Langfuse (observability), Redis (cache), and Ollama (the model server), with the warm model set at the bottom

Observability + scoring — Langfuse

The first post’s dashboard answered a hardware question: is the GPU generating, how fast? After a week of building agents the question that mattered was behavioral: what prompt actually went out, what came back, how many tokens did it burn, and why did it loop?

Langfuse, self-hosted on Thor, answers that. The LiteLLM gateway pushes a trace for every call, so each request — from Open WebUI, an n8n flow, or an agent — lands as a structured trace: resolved model, full prompt, completion, token counts, latency, and the call tree for multi-step runs. That’s the observability pillar; it’s also the foundation for scoring, because once every generation is a trace you can attach a grade to it (an eval verdict, a human thumbs-down, a cost) and start ranking prompts and models instead of guessing. It is the single most valuable tool I added, and it’s wired in at the gateway precisely so nothing can call a model without being seen.

Caching — LiteLLM + Redis

The gateway runs as three containers, and the third one is the point of this section:

litellm:        { image: ghcr.io/berriai/litellm:v1.86.2 }
litellm-pg:     { image: postgres:17-trixie }   # config + logs
litellm-redis:  { image: redis:7-alpine }       # response cache

Local inference isn’t free just because there’s no invoice — it costs watts and warm-model slots, and on a single GPU a redundant call is a request another agent is now queued behind. Redis-backed response caching at the gateway means an identical call doesn’t re-run the model at all. It’s the most boring pillar and one of the highest-leverage on constrained hardware: the cheapest token is the one you never generate.

Token optimization — Headroom, routing, and KV-cache quant

Three layers of “spend less”:

  1. Context compression. Agent workflows are context gluttons — tool outputs, file dumps, scrollback, every turn. Headroom (the headroom-ai package, in an isolated venv) is a loopback proxy (127.0.0.1:8787) that compresses context automatically before it reaches the model, with an explicit headroom-compress --role tool for squashing a big artifact on demand.
  2. Route to the cheapest sufficient model. The gateway’s scut alias points at a 9B for cheap, high-fan-out calls; you don’t burn a 30B to classify a string.
  3. Quantize the cache. Carried over from the first build — OLLAMA_KV_CACHE_TYPE=q8_0 and flash attention keep the KV cache small so context length doesn’t blow the memory budget.

The routing matters because Headroom doesn’t bypass observability to get its savings — it forwards Headroom → LiteLLM → Ollama, so the compressed call still gets a Langfuse trace and a role alias. Token optimization and auditability are designed to coexist, not trade off.

Evaluators — Hermes eval runners

This is the pillar that separates “I swapped in a new model and it felt better” from an actual engineering practice. Hermes ships eval runners — including a task-specific QuotePilot runner and a general Thor eval runner — that exercise the stack against fixed tasks so a model or prompt change can be measured rather than vibed. Paired with Langfuse, the loop closes: run the eval set, score the traces, compare against the last run, keep or revert. The same discipline as a regression suite, pointed at non-deterministic models. There’s also explicit long-run monitoring hardening, because an agent that’s fine for ten turns and degrades over a hundred is a production problem you only catch by watching the long runs.

Guardrails — access, degradation, and proof-of-action

“Guardrails” for a local agent platform, concretely:

  • Access control. LiteLLM virtual keys mean different agents authenticate with different credentials against the same backend — so I can scope, rotate, or revoke one agent without touching the others.
  • Graceful degradation. A retry/fallback policy at the gateway means a failed call to one role falls back (e.g. to a reviewer) instead of erroring up to the user. Failures should be invisible until they’re catastrophic, then loud.
  • Proof a tool actually ran. The hardest guardrail, and the one I learned the embarrassing way (next section): an agent will happily hallucinate a plausible tool result. The guardrail is a verification probe — a tool whose output the model cannot invent — gating any route before it’s trusted.
  • Isolation as a guardrail. Headroom is loopback-only; exposure is Tailscale-only; nothing has a public surface. The strongest guardrail is the attack surface that doesn’t exist.

Content-level guardrails (structured output validation, policy filters on agent actions) are the next build. The access, degradation, and proof-of-action layers are in place.

Auditability — traces plus SOPS

Auditability is observability with a longer memory and a chain of custody. Every call is a durable Langfuse trace, so “what did the agent do last Tuesday” is a query, not a shrug. And the credentials that make it all run — LiteLLM keys, the Langfuse secret — moved out of ad-hoc files into SOPS-encrypted env files (Move Thor core secrets to SOPS env files), so the audit trail includes who could have done what without a plaintext key ever hitting the repo. A trace you can’t trust the provenance of isn’t an audit.


The trap that taught me the proof-of-action guardrail

The point of all the plumbing is to run agents locally — things that call tools, read files, run commands, chain steps. The current agent is Hermes, a paired Slack/local dev agent on a LangGraph spine, pointed at Thor’s own models. Getting tool calls to work taught me the most important lesson of this phase:

A model that emits convincing tool-call text is not the same as a model that emits structured tool calls, and the difference depends on the route, not just the model.

I first pointed Hermes at the LiteLLM route (coder), for all the right production reasons — reuse, tracing, fallback. It appeared to work, then mysteriously didn’t: instead of the structured tool-call objects the API contract expects, the model emitted pseudo-tool-call text — prose that looked like a function call but couldn’t execute. The same model pointed directly at Ollama produced correct structured calls every time.

The fix wasn’t to abandon the gateway and lose tracing and aliases. It was to find the route that preserved structured tool-calling and the trace — which turned out to be Headroom (custom:thor-headroom / coder), forwarding through LiteLLM for the Langfuse trace and on to Ollama. It passed a deliberately non-guessable terminal tool probe (run a command whose output the model cannot predict, then verify it actually ran). So the live default is Headroom, with direct Ollama as the ranked fallback. Three routes, ranked by preference, all gated by the same probe.

That probe became a permanent guardrail. “The agent replied sensibly” is not evidence the tool ran — the model can invent a plausible result. The only proof is a tool whose output it could not have made up.

The experiment that came and went

For about two weeks in late May, the paired-agent slot was OpenClaw — a Slack-driven gateway I stood up, tuned, and then removed entirely on 2026-06-15: service, image, Traefik route, repo config, all of it. Good experiment, right call to delete it; Hermes does the job with tighter coupling to Thor’s models and fewer moving parts. Worth recording because the clean version of a homelab tour never shows what you ran for two weeks and tore out — and a production mindset is exactly the thing that says “this isn’t earning its complexity, cut it.”


The smaller upgrades

Voice got faster and less chatty. STT moved to Whisper tiny.en (≈2× faster on CPU, no real English-accuracy loss); TTS moved to server-side XTTS with a deliberately concise persona. Voice is live-demo-critical, so a small voice model (llama3.2:3b-voice) is pinned warm and won’t get evicted under memory pressure.

MCP tools, the easy way. mcpo exposes MCP servers as plain OpenAPI, so tools are normal HTTP any OpenAI-compatible client can call.

Image generation. A ComfyUI/Flux launcher rounds out the stack on the Blackwell GPU.

Edge: Caddy out, Traefik in. The first post terminated HTTPS with Caddy and a per-service cert file. The edge is now Traefik v3.7.1 with the Tailscale cert resolver — one resolver, every route gets its cert automatically, identity read straight off the mounted tailscaled.sock:

certificatesResolvers:
  tailscale:
    tailscale: {}

It’s the same pattern I landed on for the Kubernetes side: let the thing that owns the identity own the cert.


What I got wrong in the first roadmap

The original “What’s Next” listed TRT streaming, larger models, Home Assistant as a local LLM backend, a LoRa off-grid fallback, and a drone sentinel. The honest scorecard:

  • TRT streaming — moot; I deleted TRT.
  • Larger models — done, and it’s where the MAX_LOADED_MODELS wall appeared. 30B/35B-class are the daily drivers.
  • Home Assistant / LoRa / drone — deprioritized on purpose. I spent the six weeks building the production substrate those ideas need — gateway, tracing, caching, token control, evals, guardrails — instead of a flashy demo on a shaky base.

That reprioritization is the whole point. The fun part of a homelab AI build is the inference. The valuable part is everything around it that makes it operable: stable role names so agents survive model swaps, a trace for every call so you debug behavior instead of guessing, caching so you don’t burn watts twice, evals so changes are measured, and a verified tool route so an agent can actually act. None of it is photogenic. All of it is the difference between a server that answers questions and a platform you’d trust with real work.


Does Thor join the Kubernetes cluster?

The obvious question, given I spent the same six weeks building a bare-metal Kubernetes cluster and migrating most of the home lab onto it: does Thor fold in as a node?

Today, no — and I think that’s the right answer, not just the easy one. The cluster runs Talos Linux, which has no Tegra support, so Thor can’t run the same immutable OS as the other nodes. But the deeper reason is that the whole value of this stack is being coupled to this exact box — its GPU, its Tailscale identity, its 128GB. The clean way for the cluster to use Thor isn’t to swallow it as a node; it’s to consume its inference as an external endpoint — an OpenAI-compatible Service the cluster dials like any other backend. Two platforms, kept separate on purpose, meeting at an API.

The tempting alternative — and I do go back and forth on it — is to bring Thor in as a dedicated AI workload node: stock Kubernetes (not Talos) on the Jetson, joined to the cluster but tainted and labeled so only GPU/AI workloads land there via node affinity. That hands placement to the scheduler instead of me wiring an endpoint by hand, which is genuinely nice. But it trades the cluster’s immutable-everything property for one special-snowflake node, and I’m not convinced that’s a trade worth making yet. So for now: endpoint, not node — with the node idea filed under “maybe, once the rest is boring.” I haven’t decided, and I’d rather say that than pretend I have.


What’s next (for real this time)

This space moves fast enough that a lot has already shifted in the six weeks since I started, so this list is as much “keep up” as “build new” — and the more I learn about how this stuff actually behaves, the more I expect half of it to change again.

  • Chase the frontier. Several of the models I keep warm already have stronger replacements out. Keeping the warm set current — swap in the latest open models, re-run the evals, keep what wins — is a standing job now, not a one-time setup.
  • More observability, tighter caching. More traces at finer grain, and a cache that’s smarter about what’s genuinely reusable versus what just looks similar. Both are ongoing, not done.
  • Close the eval loop. A fixed eval set as a release gate: no model or prompt change ships without beating the last run.
  • Content-level guardrails. Structured-output validation and policy checks on agent actions, not just access and degradation.
  • Cost as a first-class metric. I have tokens and latency per trace; the next dashboard scores cost-per-task (tokens × model tier × cache-hit rate) so optimization has a number to move.
  • Keep experimenting. I’m still early in understanding this space, and the fastest way to learn it is to keep building on the platform — new agents, new tools, new routes — and let the evals tell me what’s real.
  • Home Assistant on the gateway. “Turn off the lights” as a local, traced, guardrailed tool call.
  • And eventually, the drone. The first post ended on an autonomous perimeter sentinel — patrol the property, stream video back, report what it sees through the vision model. I still want it. It’s just a longer way out than I’d hoped: building the platform underneath turned out to be the real prerequisite, and that’s most of what these six weeks were. The sentinel is still the north star; this stack is what it’ll fly home to.

The hardware hasn’t changed since April. What’s changed is that I can see what the models are doing, I’m not burning the GPU to recompute answers I already have, and I can hand an agent a real task without watching it the whole time. That’s the difference between a cool demo and something I actually run my work through — and it’s where the next six weeks go too.

π