Skip to main content

Clusters and nodes

This is the operator view: how to target a cluster, shape it, add and remove nodes, and what runs underneath. Verify any command with --help:

runos clusters --help
runos nodes --help
runos integrations --help

Targeting a cluster (cid)

A cluster is identified by a cid: 3 to 16 lowercase alphanumeric characters. Live examples: ky3, xym. Every cluster belongs to an account aid (example: rjwrn).

runos clusters add allocates the cid. The cluster becomes a real Kubernetes cluster when the first node joins and bootstraps the control plane, then runos clusters configure installs base infrastructure.

List your clusters:

runos clusters list
# CID NAME ISCONFIGURED
# ky3 tc3-ky3 true
# xym tc2-xym false

isConfigured is true once the base stack is installed. Most commands take --cid <cid>. Set a default so you can drop the flag:

runos config set cid ky3

Cluster shape (roles, node types, HA)

A node carries one or both roles:

  • control-plane: runs the Kubernetes API server and etcd.
  • worker: runs your workloads.

A node can be both: a single-node cluster has one node that is isCp: true and isWorker: true.

Nodes come in two types:

  • manual: your own server, joined with a generated command.
  • cloud-provisioned: created by RunOS through a compute integration. The CLI provisions on hetzner, digitalocean, and scaleway. (hyperstack and lambdacloud are compute-capable integrations with no provisioning command, so bring those as manual nodes.)

One node is not high availability. HA needs an odd number of control-plane nodes so etcd keeps quorum; 3 is the standard count.

Inspect the shape:

runos nodes list --cid ky3
runos nodes show <nid> --cid ky3

Add a node

Cloud-provision a node (the provider needs an integration first, see runos integrations list):

runos integrations hetzner add-server --cid ky3 \
--integration-id <id> --request-as-worker

Or join a server you already own:

runos nodes join-command ssh-remote --cid ky3 --request-as-worker

The join token is single-use and short-lived (the response carries an expiresAt). Run the command once, on the target host.

Each node must clear preflight before it installs. The floor: at least 2 CPU, 3.5 GB RAM, cgroup v2 (with cpu, memory, pids controllers), and Ubuntu 22.04, 24.04, or 26.04 (interim releases like 25.04 are rejected). Disk: at least 15 GiB free on the volume backing /var/lib/containerd (RunOS recommends 20 GB+ free on /var; containerd images and etcd fill it fast). The node must also reach the Kubernetes registries pkgs.k8s.io and registry.k8s.io on egress.

A node's status is ready while its agent heartbeats. Miss heartbeats for 30 seconds and it flips to no_heartbeat.

Remove a node

Removal is driven by the control plane: it drains the node (cordon, then evict its pods) and, for a control-plane node, removes its etcd member.

runos nodes delete <nid> --cid ky3

Two things to know:

  • The cloud server is not deleted by default. You keep getting billed until you pass --delete-cloud-instance, or delete it at the provider.
  • You cannot remove the last control-plane node normally. The delete is refused unless you also reset the cluster (--reset-cluster), which is irreversible.

runos nodes delete-preflight warns if a removal would leave etcd on an even voting-member count. It is advisory and never blocks the delete.

Kubernetes version

The Kubernetes version is set per cluster and is currently 1.35. Upgrades run node by node: each node is cordoned, drained, upgraded, and uncordoned, so the cluster stays up.

Networking and ingress

The pod network is Cilium. Ingress is Traefik, run as a host-network DaemonSet on ports 80 and 443.

There is no cloud load balancer. The Traefik Service is type: ClusterIP, not type=LoadBalancer. Traffic reaches Traefik directly on the node's ports.

Control-plane high availability

The Kubernetes API is fronted by a control-plane VIP among the healthy control-plane nodes. Each node runs an on-node load balancer in front of the Kubernetes API, so the API stays reachable if a control-plane node goes down.

Connectivity (internal overlay vs user VPN)

Two separate network planes:

  1. Internal overlay: a private WireGuard mesh carries node-to-node cluster traffic. It is not user-reachable.
  2. User-access VPN: an opt-in tunnel (a separate wg1 interface) for operators to reach cluster-internal services. You pick one of WireGuard or NetBird; they are mutually exclusive.

The internal overlay always exists. The user VPN is something you turn on.

Cluster domains and certificates

Every cluster gets a built-in domain in the form cid.aid.<root>. Example: ky3.rjwrn.example.com.

Its wildcard TLS certificate is issued by cert-manager using a DNS-01 challenge. List and add domains:

runos cluster-domains list-by-cluster --cid ky3
runos cluster-domains add ... # custom wildcard via Cloudflare DNS-01

A custom domain provisions its own ClusterIssuer, wildcard cert, and wildcard DNS record.

Storage and durability

Storage is node-local by default, backed by OpenEBS (the openebs-local StorageClass is the default). Node-local means data lives on one node, so a single-node cluster has no cross-node redundancy. Add nodes and opt into LINSTOR for volumes that survive a node loss.

For durability across nodes, opt into LINSTOR distributed block storage (DRBD-backed, install-gated). On cloud nodes you can carve a raw data partition for it at provision time with --provider-config.repartition.