Skip to main content

Core concepts

This is the model you need once you are past quickstart. Other pages link back here for the details.

Desired state, live state, reconciliation

Two states, kept apart on purpose.

  • Desired state is your intent: what you asked RunOS to run. It lives in the control plane, separate from your cluster.
  • Live state is the running Kubernetes objects on your cluster right now.

The RunOS API is the control plane. It reconciles live state toward desired state, continuously. You change desired state; RunOS makes the cluster match.

Reads are fast and synchronous. Slow writes (create, update, or delete a service, or a deploy) run as a background job and return a jobId. Watch it with runos follow <jobId> or runos jobs show <jobId>.

"Manifest" means three things

The word is overloaded. Keep them straight.

  1. The CLI command catalog. runos manifest update downloads the latest command and endpoint definitions so your CLI knows what verbs exist. runos manifest list and runos manifest show inspect it.
  2. Your IaC file, runos.yaml. This is the manifest in the infrastructure-as-code sense: it describes one app or service. runos apps diff and runos apps sync both read it.
  3. The rendered Kubernetes manifest RunOS applies to the cluster. You normally never see it. You can patch it with files under overrides/.

Infrastructure as code (pull, diff, sync)

Three verbs round-trip config between the cluster and local files.

  • runos pull downloads a running app's config into runos.<cid>.<id>/ (the runos.yaml, .env, secret files, and overrides).
  • runos apps diff compares your local files against server state and writes nothing. Exit codes are CI-gate friendly: 0 = clean, 2 = drift you should reconcile, 1 = the diff itself errored.
  • runos apps sync pushes local files back to the cluster. It plans first (prints what would change), then applies.

So a CI pipeline can gate on runos apps diff (fail on exit 2) and apply with runos apps sync.

Identity (OSID, namespaces, cluster domain, wildcard cert)

Every instance has an OSID: <type>-<id>, where <id> is a 5-char suffix. Live examples: postgresql-bcv4l, cert-manager-h3xwl, traefik-i7r7v.

  • The Kubernetes namespace name equals the OSID. One namespace per instance.
  • The cluster domain is <cid>.<aid>.<root>. cid is 3 to 16 lowercase alphanumeric chars (live: ky3); aid is your account id (rjwrn); root is fixed per cluster at provision (for example example.com). So cluster ky3's domain is ky3.rjwrn.<root>.
  • An instance host is <osid>.<clusterDomain>. An exposed port gets its own auto-route, <osid>-<port>.<clusterDomain>. Example: web-3f9k2-8080.ky3.rjwrn.<root>.
  • One wildcard cert (*.<clusterDomain>) covers every host on the cluster. There is no per-service certificate.

Sizing (resource requirement classes)

A resource requirement class sets CPU, memory, replicas, and storage from one id. The grammar is service.tier.size. Live examples: postgresql.c0.beff, traefik.c0.beff; apps use app.sl1.beff.

  • The tier encodes replica count and storage type. c0 is one replica on node-local storage. c1 is network-replicated (distributed) storage. c2 is two replicas on local storage.
  • The size encodes resources: beff (best-effort), small, medium, large.
  • The default is best-effort (beff): CPU and memory limits only, zero requests. In Kubernetes terms that is Burstable QoS. There is no autoscaling (no HPA).
  • Setting cpu, memory, or replicas yourself flips the class to custom.

Config sets

A config set is account-scoped, versioned, shared advanced config you can reuse across instances (for example shared_buffers=256MB).

  • Immutable: an edit creates the next version (v1, v2, v3). setId is the stable lineage handle; isLatest marks the current version.
  • Consumed by postgresql and vllm.
  • Manage them with runos account config-sets add|list|show|update.

The control loop (agents dial out, heartbeat = readiness)

Every node runs a node agent; every cluster runs a cluster agent. Each one dials out to the control plane and holds a single long-lived outbound connection open.

The control plane pushes instructions down that stream. It never polls and never connects in. That is why a node behind NAT works with no inbound ports.

Each agent sends a heartbeat on that connection. A fresh heartbeat means the node is ready. If the heartbeat goes stale (none for roughly 30 seconds) the node drops out of ready.

Security (trust tiers and credentials)

The control plane uses two trust tiers.

  • Bootstrap. A node registers with a short-lived registration token over one-way TLS, and trades it for its own mTLS client certificate.
  • Steady state. Every agent-to-control-plane stream is mutual TLS (mTLS). The node or cluster agent authenticates with its own certificate.

Three credential types you use directly.

  • Web session token. Your Console login session.
  • Personal access token (PAT). Account-scoped and expiring. Format runos_pat_<keyId>.<secret>, sent by the CLI as Authorization: Bearer runos_pat_.... The RUNOS_API_KEY env var overrides the stored key.
  • Notify API keys. Account-scoped keys for the Notify service.