Clusters and nodes
This is the operator view: how to target a cluster, shape it, add and remove nodes, and what runs underneath. Verify any command with --help:
runos clusters --help
runos nodes --help
runos integrations --help
Targeting a cluster (cid)
A cluster is identified by a cid: 3 to 16 lowercase alphanumeric characters. Live examples: ky3, xym. Every cluster belongs to an account aid (example: rjwrn).
runos clusters add allocates the cid. The cluster becomes a real Kubernetes cluster when the first node joins and bootstraps the control plane, then runos clusters configure installs base infrastructure.
List your clusters:
runos clusters list
# CID NAME ISCONFIGURED
# ky3 tc3-ky3 true
# xym tc2-xym false
isConfigured is true once the base stack is installed. Most commands take --cid <cid>. Set a default so you can drop the flag:
runos config set cid ky3
Cluster shape (roles, node types, HA)
A node carries one or both roles:
- control-plane: runs the Kubernetes API server and
etcd. - worker: runs your workloads.
A node can be both: a single-node cluster has one node that is isCp: true and isWorker: true.
Nodes come in two types:
- manual: your own server, joined with a generated command.
- cloud-provisioned: created by RunOS through a compute integration. The CLI provisions on
hetzner,digitalocean, andscaleway. (hyperstackandlambdacloudare compute-capable integrations with no provisioning command, so bring those as manual nodes.)
One node is not high availability. HA needs an odd number of control-plane nodes so etcd keeps quorum; 3 is the standard count.
Inspect the shape:
runos nodes list --cid ky3
runos nodes show <nid> --cid ky3
Add a node
Cloud-provision a node (the provider needs an integration first, see runos integrations list):
runos integrations hetzner add-server --cid ky3 \
--integration-id <id> --request-as-worker
Or join a server you already own:
runos nodes join-command ssh-remote --cid ky3 --request-as-worker
The join token is single-use and short-lived (the response carries an expiresAt). Run the command once, on the target host.
Each node must clear preflight before it installs. The floor: at least 2 CPU, 3.5 GB RAM, cgroup v2 (with cpu, memory, pids controllers), and Ubuntu 22.04, 24.04, or 26.04 (interim releases like 25.04 are rejected). Disk: at least 15 GiB free on the volume backing /var/lib/containerd (RunOS recommends 20 GB+ free on /var; containerd images and etcd fill it fast). The node must also reach the Kubernetes registries pkgs.k8s.io and registry.k8s.io on egress.
A node's status is ready while its agent heartbeats. Miss heartbeats for 30 seconds and it flips to no_heartbeat.
Remove a node
Removal is driven by the control plane: it drains the node (cordon, then evict its pods) and, for a control-plane node, removes its etcd member.
runos nodes delete <nid> --cid ky3
Two things to know:
- The cloud server is not deleted by default. You keep getting billed until you pass
--delete-cloud-instance, or delete it at the provider. - You cannot remove the last control-plane node normally. The delete is refused unless you also reset the cluster (
--reset-cluster), which is irreversible.
runos nodes delete-preflight warns if a removal would leave etcd on an even voting-member count. It is advisory and never blocks the delete.
Kubernetes version
The Kubernetes version is set per cluster and is currently 1.35. Upgrades run node by node: each node is cordoned, drained, upgraded, and uncordoned, so the cluster stays up.
Networking and ingress
The pod network is Cilium. Ingress is Traefik, run as a host-network DaemonSet on ports 80 and 443.
There is no cloud load balancer. The Traefik Service is type: ClusterIP, not type=LoadBalancer. Traffic reaches Traefik directly on the node's ports.
Control-plane high availability
The Kubernetes API is fronted by a control-plane VIP among the healthy control-plane nodes. Each node runs an on-node load balancer in front of the Kubernetes API, so the API stays reachable if a control-plane node goes down.
Connectivity (internal overlay vs user VPN)
Two separate network planes:
- Internal overlay: a private WireGuard mesh carries node-to-node cluster traffic. It is not user-reachable.
- User-access VPN: an opt-in tunnel (a separate
wg1interface) for operators to reach cluster-internal services. You pick one of WireGuard or NetBird; they are mutually exclusive.
The internal overlay always exists. The user VPN is something you turn on.
Cluster domains and certificates
Every cluster gets a built-in domain in the form cid.aid.<root>. Example: ky3.rjwrn.example.com.
Its wildcard TLS certificate is issued by cert-manager using a DNS-01 challenge. List and add domains:
runos cluster-domains list-by-cluster --cid ky3
runos cluster-domains add ... # custom wildcard via Cloudflare DNS-01
A custom domain provisions its own ClusterIssuer, wildcard cert, and wildcard DNS record.
Storage and durability
Storage is node-local by default, backed by OpenEBS (the openebs-local StorageClass is the default). Node-local means data lives on one node, so a single-node cluster has no cross-node redundancy. Add nodes and opt into LINSTOR for volumes that survive a node loss.
For durability across nodes, opt into LINSTOR distributed block storage (DRBD-backed, install-gated). On cloud nodes you can carve a raw data partition for it at provision time with --provider-config.repartition.