Building a Bare-Metal Kubernetes Homelab — Part 0: PXE-Booting Talos

This is Part 0 of a series on building a real bare-metal Kubernetes cluster at home — not a single-node k3s toy, but the kind of setup whose patterns transfer directly to production: immutable nodes, GitOps, declarative everything, and machines you can wipe and rebuild from the network without touching them.

Part 0 is the foundation: PXE-booting Talos Linux off a server I run on my NAS, and bringing up the first control-plane node. It is also a fairly complete catalog of the ways I broke it before it worked. I’m including those on purpose. The clean version of this post would be half the length and a tenth as useful.

The repo is public: github.com/josephsindel/home_stack. The commit history is the real build log.


Why Talos, why PXE

Talos Linux is an immutable, API-driven Kubernetes OS. No SSH, no shell, no package manager — the machine is configured by a single declarative machine config applied over an API. That makes nodes genuinely disposable: the entire state of a node is its config plus its disks, and the config lives in Git.

PXE booting is the other half. If a node can netboot, then “rebuild the cluster” stops meaning “find a USB stick and walk to the rack” and starts meaning “wipe the disk and power-cycle.” For a homelab I’m going to be tearing down and rebuilding constantly while I play, that property is the entire point. (It also, as you’ll see below, turned a self-inflicted disaster into a ten-minute non-event.)

The design follows a strict layering rule:

  • L0 — provisioning. The PXE server. Lives on the NAS. Never a cluster node.
  • L1 — the cluster. Talos + Kubernetes on the MS-01s.
  • L2 — workloads.

Nothing at a lower layer is allowed to depend on anything higher. The PXE host can’t be a node it provisions; that circular dependency is how you end up unable to rebuild the thing that rebuilds things.


The hardware

  • Compute: Minisforum MS-01 (i5-12600H, 32GB, 1TB NVMe, UEFI). Dual i226 2.5G + dual X710 SFP+ 10G.
  • L0 host: Synology DSM 7.2.2 (x86_64). Runs the PXE stack as a Container Manager project.
  • Network: flat(for now), behind a Google/Nest Wifi mesh doing DHCP and DNS.

That last point matters more than it looks. Google’s mesh exposes no DHCP option controls — you cannot set next-server or a bootfile. Normally that’s a homelab annoyance. Here it’s the thing that picks the architecture: proxyDHCP.


The PXE pipeline

proxyDHCP is the trick for networks where you can’t (or shouldn’t) touch the real DHCP server. dnsmasq runs in proxy mode: it assigns no addresses — Google keeps doing that — it only answers the PXE-specific bits, and the client merges the two responses. Zero changes to the router.

The whole netboot brain is about a dozen lines of dnsmasq.conf:

port=0
interface=eth0
bind-interfaces
dhcp-range=192.168.87.0,proxy,255.255.255.0
enable-tftp
tftp-root=/srv/tftp

# loop-break: is this the firmware, or iPXE that already loaded?
dhcp-userclass=set:ipxe,iPXE

# stage 1: bare UEFI firmware -> chainload ipxe.efi over TFTP
pxe-service=tag:#ipxe,X86-64_EFI,"Chainload to iPXE",ipxe.efi
# stage 2: request is from iPXE -> hand it the HTTP boot script
pxe-service=tag:ipxe,X86-64_EFI,"Talos netboot",http://<nas>:8080/ipxe/boot.ipxe

log-facility=-
log-dhcp

port=0 disables dnsmasq’s DNS entirely — Google owns DNS, and a rogue resolver is the last thing you want. The dhcp-userclass line is the part everyone gets wrong: iPXE announces itself with user-class iPXE, so the second time it asks, you hand it the HTTP script instead of ipxe.efi again. Without it, iPXE chainloads itself forever.

The Talos kernel and initramfs (≈120MB together) are served over HTTP, not TFTP. TFTP moves the ~1MB iPXE binary at a glacial 512-byte-block crawl; you do not want a 98MB initramfs going that way. Keep TFTP for the bootstrap binary only.

It runs as two containers on the NAS — host-net dnsmasq (it has to hear the raw DHCP broadcast; a NAT’d container never will) and a plain nginx for the HTTP assets, both read-only mounting the asset tree.


The perils

Everything above is the clean result. Here is the actual evening, because the failures are where the learning is and every one of these is a transferable lesson.

1. The kernel cmdline is not yours to write

Hand-rolled iPXE kernel line with talos.platform=metal console=tty0 and not much else. Talos booted, then failed systemRequirements (phase 1/9): kernel parameter slab_nomerge is required, pti is required.

Talos enforces a KSPP hardening set on the cmdline. You don’t author that line — you fetch it from the Image Factory for your exact schematic and version:

$ curl -fsSL https://factory.talos.dev/image/<id>/v1.13.2/cmdline-metal-amd64
talos.platform=metal console=tty0 init_on_alloc=1 slab_nomerge pti=on \
consoleblank=0 nvme_core.io_timeout=4294967295 printk.devkmsg=on \
selinux=1 module.sig_enforce=1 proc_mem.force_override=never

Bake that into boot.ipxe, append your own console args, done. Don’t hand-write security-critical kernel parameters from memory.

2. --force rotates your CA

This is the one that cost the most, and it was entirely self-inflicted.

talosctl gen config generates the cluster PKI. I applied a config to the node, then tweaked the config and re-ran talosctl gen config ... --force — twice. --force regenerates the entire CA every run. The node was now running a config trusting CA #1; my talosconfig was CA #3. They will never speak again:

x509: certificate signed by unknown authority (… "talos")

I had locked myself out of the node I’d just built.

The fix is also the correct long-term practice, and it’s why mature Talos setups SOPS-encrypt one secrets bundle: generate the secrets once, persist them, and always pass them in.

talosctl gen secrets -o secrets.yaml          # once, ever
talosctl gen config <cluster> <endpoint> --with-secrets secrets.yaml ...

3. The deadlock: interface names lie

This is the marquee lesson and the one I’ll never make again.

My machine config pinned the node’s static IP to interface: enp88s0 — the name the boot NIC had during the maintenance-mode discovery. After the config was applied and the node rebooted, that same physical NIC (same MAC) came back as enp87s0. Predictable interface names are derived from PCI enumeration and they are not guaranteed stable across boots.

Watch the cascade this kicked off:

  1. Static .101 config bound to enp88s0 — now a different, down NIC.
  2. The live NIC (enp87s0) had no config, fell back to DHCP, no proper default route or DNS.
  3. No DNS/route → NTP can’t reach a time server → the clock is wrong.
  4. Wrong clock → TLS certificate validity checks fail → every API call: tls: certificate required / expired certificate.
  5. API unreachable → I can’t fix the networking that’s causing all of it.

A genuine chicken-and-egg deadlock, and the root cause was one word: interface:. The fix is to never bind by name — bind by hardware address:

machine:
  network:
    interfaces:
      - deviceSelector:
          hardwareAddr: "38:05:25:37:a1:67"
        dhcp: false
        addresses: ["192.168.87.101/24"]
        routes:
          - { network: 0.0.0.0/0, gateway: 192.168.87.1 }
        vip:
          ip: 192.168.87.100

Once the selector was MAC-based, every downstream failure — clock, certs, etcd — cleared on its own. They were all symptoms of that one line.

4. The escape hatch — and the whole point

To break the deadlock I needed the node in maintenance mode (whose API is insecure, so it doesn’t care about the broken clock). But Talos, when it netboots, finds the existing install’s config on disk and boots that — not maintenance. The only way back was to wipe the disk, and I had no working credentials to do it cleanly.

Talos has a no-credentials escape: the talos.experimental.wipe=system kernel parameter. Add it to the boot script, power-cycle, the node wipes its system disk and comes back into clean maintenance. Reapply the corrected config. Done.

And that is the actual thesis of this whole exercise. I had thoroughly bricked my access to a freshly-built node — orphaned PKI, deadlocked clock, locked out. Recovery was: change one line in a file on the NAS, power-cycle, wait ten minutes. The property I was building — every node reprovisionable from the network — is the same property that made a self-inflicted disaster a non-event. Cattle, not pets, is not a slogan; it’s the thing that saves you at 11pm.

(Bonus peril, for flavor: one of my recovery scripts stored an ssh command in a shell variable and called $VAR args. Works in bash. The harness shell was zsh, which doesn’t word-split unquoted variables, so every call was command not found and the script silently did nothing. Know your shell. This caused me genuine anguish and pain…)


Where it landed

$ kubectl get nodes
NAME             STATUS   ROLES           VERSION
home-control-1   Ready    control-plane   v1.36.0

Talos v1.13.2, Kubernetes v1.36.0, etcd bootstrapped, Cilium 1.18.0 as the CNI with kubeProxyReplacement=true pointed at Talos KubePrism (localhost:7445) — kube-proxy is disabled entirely. The control-plane VIP is live. One MS-01, netbooted end-to-end from the NAS, zero installer media ever touched.


What’s next

This is Part 0. The series from here:

  • Part 1 — the rebuild loop. talosctl reset and watch a node wipe, netboot, and rejoin with zero hands. Proving the cattle property on purpose instead of by accident.
  • GitOps. Flux owning infrastructure/, Argo CD owning apps/, exactly one reconciler per resource. SOPS-age for the secrets bundle from peril #3.
  • Observability and policy. kube-prometheus-stack, Loki, Tempo, OpenTelemetry; SLOs as code; Kyverno for policy.
  • HA. Nodes 2 and 3 — real three-node etcd quorum, the VIP doing its actual job.
  • The Jetson. Folding my private AI server in as an external GPU inference appliance — not as a cluster node (Talos has no Tegra support), but integrated via a selector-less Service so the cluster consumes inference as a normal in-cluster endpoint.

The cluster works. The interesting part — making it a platform — starts in Part 1.

π