The Tailscale saga (Part 3): one ACL, four broken things

Author

Mateus Harrington

Published

May 23, 2026

The Tailscale saga (Part 3): one ACL, four broken things

This is part 3 of a four-part series. Index · Part 1 brought up the cluster. Part 2 put it under GitOps. This is the part where I tried to give the cluster network access to the rest of the homelab, and accidentally rewrote my Tailscale ACL three times in one evening.

If you only read one part, read this one. The other parts could have been blog posts; this one was an experience.

Diagram placeholder: a network diagram showing the Tailnet as a cloud, with tag:truenas, tag:k8s-node (×3 Talos nodes), tag:k8s (×1 Grafana Ingress), and tag:family-immich (×1 Immich sidecar) as distinct devices on it. Arrows showing the ACL grants between them.

The starting point

Before any of this, my Tailscale model was:

  • TrueNAS host owned by my user account, visible in the tailnet as truenas.
  • All my personal devices owned by my user account.
  • A handful of family members invited as users, with their devices also on the tailnet.
  • A single ACL that effectively said “members can reach everything members own.”

To share Immich with family, I’d used Tailscale’s node sharing to share the TrueNAS host with each family member. This worked, but it shared everything on the TrueNAS host — including the AdGuard admin page, the *arr stack, the Jellyfin server. The ACL pretended that wasn’t true, but it was a fiction maintained by trust rather than enforcement.

I wanted three things to be true at once:

  1. Kubernetes nodes should be able to mount NFS from TrueNAS.
  2. Family members should be able to reach only Immich — ideally without me having to maintain per-port ACL rules every time I added a service.
  3. Future-me should be able to expose Kubernetes-hosted services (Grafana, eventually Home Assistant) over Tailscale, with sane per-service URLs.

Achieving all three of those at the same time turned out to need four separate fixes. None of them were independently hard. The combination ate an evening.

Stumbling block 1: the tag split

The first thing I tried was: “right, I’ll put a tag on the Talos nodes, give the operator a tag for its Services, and write ACL grants between them.” Easy.

It was not easy. Here’s where I went wrong.

The Tailscale Kubernetes Operator uses an OAuth client to provision Tailnet devices for each Ingress you create. The tag those devices get is set by oauth.defaultTags (in Helm values) or by per-Ingress annotations. I’d set this to tag:k8s, and the OAuth client’s authorised tags also included tag:k8s. Fine so far.

Meanwhile the Talos nodes themselves join the tailnet via the siderolabs/tailscale system extension. Each node runs its own tailscaled and advertises tags via TS_EXTRA_ARGS. I’d originally set this to --advertise-tags=tag:k8s too, on the grounds that “they’re all part of the same cluster, right?”

This is wrong for two reasons:

  • Talos nodes need to make outbound calls to the TrueNAS host for NFS. Operator-managed Services don’t.
  • Operator-managed Services need to be reachable inbound from my personal devices. Talos nodes mostly don’t (and shouldn’t — the kubelet isn’t a thing I want exposed).

Lumping them into one tag means the ACL has to grant the union of both directions to both groups. Splitting them lets each get exactly what it needs.

The fix was straightforward once I’d figured it out: rename the nodes’ tag to tag:k8s-node, register it in tagOwners in the ACL, and update the extension config:

# proxmox/tailscale-extension.yaml
apiVersion: v1alpha1
kind: ExtensionServiceConfig
name: tailscale
environment:
  - TS_AUTHKEY=tskey-auth-...
  - TS_EXTRA_ARGS=--advertise-tags=tag:k8s-node --accept-dns=false

(The --accept-dns=false is there because the nodes have their own DNS setup and I don’t want tailscaled overwriting it.)

The grants then become small and obvious:

// Cluster nodes can mount NFS from TrueNAS
{ "src": ["tag:k8s-node"], "dst": ["tag:truenas"],
  "ip":  ["tcp:2049"] },
// My personal devices can reach operator-managed services
{ "src": ["autogroup:member"], "dst": ["tag:k8s"],
  "ip":  ["*"] }

I also hit a smaller version of the same problem in the operator Helm values: defaultTags was set to tag:k8s but the OAuth client wasn’t actually authorised to request that tag (an admin mistake I’d made earlier). The operator would create an Ingress device, fail to tag it, and the device would end up untagged and unreachable. Two-line fix in the ACL once I noticed the operator logs were complaining about it. Should have read the logs sooner.

Stumbling block 2: kubelet picked the wrong nodeIP

Right after rolling out the siderolabs/tailscale extension, the cluster broke. flux get all -A started returning timeouts. kubectl logs from my laptop took thirty seconds and then failed.

Here’s what had happened. When tailscale0 came up on each Talos node, kubelet picked the new interface’s IP — a 100.x.y.z Tailscale address — as its nodeIP. Kubernetes node-to-node traffic then tried to route over the tailnet, including via DERP relays when a direct connection wasn’t available. Everything became extremely slow.

The fix is a one-file patch:

# proxmox/patch-node-ip.yaml
machine:
  kubelet:
    nodeIP:
      validSubnets:
        - 192.168.1.0/24

…applied with talosctl patch machineconfig, followed by a reboot of each node. Now kubelet picks the LAN IP, node-to-node traffic stays on the LAN, and the tailnet is purely an egress path for NFS and a few other things.

This one is now part of my baseline machine config, so it only ever needs to be applied again if I add a new node or rebuild an existing one. The full procedure is in docs/talos-extensions-rollout.md under “Step 6a”.

The lesson here is one I keep relearning: when you add an interface to a host, something somewhere will probably try to use it for the wrong purpose by default. Always check what kubelet (and routing tables, and /etc/resolv.conf) think the new interface is for.

Stumbling block 3: the democratic-csi controller couldn’t reach TrueNAS

With nodes on the tailnet and ACL grants in place, NFS mounts from the kubelet worked fine. I deployed democratic-csi as the NFS provisioner, pointed it at TrueNAS, and watched it fail to create a single PV.

The issue: democratic-csi has two parts.

  • Node pods, which run on each Talos node and do the actual mount syscalls. These were fine — kubelet is on the host network, the host can route to TrueNAS over tailscale0.
  • Controller pod, which talks to the TrueNAS HTTPS API to provision and destroy datasets. The controller was running in the cluster’s pod network, which doesn’t have a route to the tailnet.

The fix needed two things at once:

  1. controller.hostNetwork: true in the democratic-csi Helm values. This puts the controller pod in the node’s network namespace, so it can use tailscale0 directly.

  2. An ExternalName Service annotated for the Tailscale operator, so the driver config doesn’t need to bake in a raw tailnet IP:

    apiVersion: v1
    kind: Service
    metadata:
      name: truenas-tailscale
      namespace: democratic-csi
      annotations:
        tailscale.com/tailnet-ip: "100.x.y.z"
    spec:
      type: ExternalName
      externalName: placeholder  # operator overwrites this
      ports:
        - { name: https, port: 444, protocol: TCP }

    The Tailscale operator notices the annotation, sets up an egress proxy for that tailnet IP, and the driver config points at truenas-tailscale.democratic-csi.svc.cluster.local.

Strictly, hostNetwork on its own would have been enough. The ExternalName Service is there so that if the TrueNAS tailnet IP ever changes, I update one annotation rather than re-encrypting the SOPS-encrypted driver config.

This took me embarrassingly long to debug, because the failure mode was “the controller pod is Running and Ready, but no PVs are ever created and the events tab is silent.” The clue is in kubectl logs on the controller pod — but only if you grep specifically for “TrueNAS” or “freenas-api”, since the actual error is a connection timeout buried in a stack trace.

Stumbling block 4: Immich needed to leave home

Solving the family-sharing problem turned out to be the thing that forced the Immich migration, not the other way round.

Here’s the chain of reasoning:

  • I’d just moved TrueNAS to tag:truenas (so the kubelet ACL grant could be tag-based, not user-based).
  • Tagged machines cannot be node-shared in Tailscale. (This is documented behaviour — sharing is at the user level; tags replace user ownership.)
  • That meant my family members lost access to Immich, because the TrueNAS host was no longer share-able.

The cleanest fix was to give Immich its own Tailnet device. To do that, it needed to leave the TrueNAS apps system (which only has one tailscaled, owned by the host) and become a Docker Compose stack with a tailscale/tailscale sidecar in the same Compose network. The sidecar would join the tailnet under tag:family-immich, run tailscale serve to terminate HTTPS at the MagicDNS hostname, and proxy traffic to the immich-server container.

The Compose looks roughly like this (full version in the repo):

services:
  immich-server:
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    # No `ports:` — access is Tailnet-only via the sidecar.
    ...

  immich-family-ts:
    image: tailscale/tailscale:latest
    hostname: immich-family
    environment:
      - TS_AUTHKEY=${TS_AUTHKEY}
      - TS_STATE_DIR=/var/lib/tailscale
      - TS_SERVE_CONFIG=/config/serve.json
      - TS_USERSPACE=true
    volumes:
      - /mnt/HDDs/immich/family-ts:/var/lib/tailscale
    configs:
      - source: immich-family-ts-serve
        target: /config/serve.json

configs:
  immich-family-ts-serve:
    content: |
      { "TCP": { "443": { "HTTPS": true } },
        "Web": { "$${TS_CERT_DOMAIN}:443": {
          "Handlers": { "/": { "Proxy": "http://immich-server:2283" } }
        } },
        "AllowFunnel": { "$${TS_CERT_DOMAIN}:443": false } }

Two things to call out:

  • network_mode: service: is a trap here. The “obvious” way to set up a tailscale sidecar is to put the app in the sidecar’s network namespace (network_mode: service:immich-family-ts). That works for single-container apps. Immich isn’t one — it needs to resolve database and redis as service-name DNS, which doesn’t work from inside the sidecar’s namespace because the sidecar runs in user-space and doesn’t have Docker’s embedded DNS. The fix is to leave both the app and the sidecar on the Compose network and have the sidecar’s serve.json reach the app by container name. I learned this by spending about an hour staring at “redis: name does not resolve” errors.
  • Family members get shared just this device. The Tailscale share UI lets me share immich-family with each family member’s account. They see one device in their tailnet called immich-family.<my-tailnet>.ts.net, and that’s it. No AdGuard, no Jellyfin, no TrueNAS UI.

While I was at it I migrated the ImmichFrame slideshow to the same model — its own sidecar, its own Tailnet device, shared to the same family members. That uses network_mode: service: because ImmichFrame is a single container and doesn’t need DNS.

Screenshot placeholder: the Tailscale admin console showing the new model — truenas (tagged), talos-cp1, talos-w1, talos-w2 (tagged k8s-node), grafana (tagged k8s), immich-family (tagged family-immich), my laptop, and a couple of family members’ phones in the shared column.

The Postgres-18 booby trap

The migration came with one final surprise. The upstream Immich Compose template mounts the database volume at /var/lib/postgresql/data. That’s correct for Postgres 14, which is what their template targets.

My inherited data, from the previous TrueNAS Immich app, was on Postgres 18. Postgres 18 changed the on-disk layout: the data directory is expected to be /var/lib/postgresql, with version-numbered subdirectories underneath (e.g. /var/lib/postgresql/18/). Mounting the existing data at /var/lib/postgresql/data produced a “this looks like an empty data directory” message and Postgres helpfully initdb-ed a fresh one — which would have nuked my photos if I’d let it run.

The fix is a one-line change in the Compose:

volumes:
  - ${DB_DATA_LOCATION}:/var/lib/postgresql  # NOT /data

…and a comment block above it that’s about ten times longer than the line itself, because future-me will absolutely forget.

The Docker Hub PR that documents the change is here if you want the upstream rationale. The TL;DR is “always check the data-directory layout when bumping a major Postgres version, even if you didn’t think you were bumping it.”

What you have at the end of Part 3

  • A Tailscale tag model that’s actually load-bearing, with each tag granted exactly what it needs.
  • A Kubernetes cluster whose kubelet talks to the LAN, and only the LAN, for node-to-node traffic.
  • A democratic-csi install that can both mount NFS volumes (kubelet) and talk to the TrueNAS API (controller) over the tailnet.
  • An Immich (and ImmichFrame) deployment that lives as its own Tailnet device, shareable to family without exposing the rest of the homelab.

This took longer to write than I would like to admit. It also took longer to do than I would like to admit. The thing about a homelab is that nobody is paying you for it, so the only incentive to actually finish the writeup is the suspicion that future-you will need it. Future-me almost certainly will.

Onward to Part 4, the actual reward.

Back to top