Operations

This page covers running RunOS day to day: tracking async work, reading logs and metrics, debugging a bad deploy, alerting, healing certificates, operating managed services, and managing accounts and credentials.

Jobs (the async model)

Every slow operation (create a service, deploy an app, add a node) returns a jobId and runs in the background. You track it; you do not wait on the request.

runos jobs list --cid ky3       # recent jobs
runos jobs show <jobId>         # status and details
runos follow <jobId>            # block until it finishes

A job moves through pending, then running, then one terminal state: completed, failed, cancelled, or timed_out.

timed_out is record-only. RunOS stops tracking a job that has been running with no progress past a timeout. It does not kill the underlying Kubernetes or Helm work, which keeps reconciling. So timed_out is not the same as failed: do not conflate them. Re-check the cluster and re-run the operation idempotently rather than assuming nothing happened.

runos jobs cancel <jobId> is advisory. The current work item finishes and later items are skipped, but side effects already applied to the cluster stay. The response sets partialState: true when a job was cancelled mid-run, so inspect runos jobs show and decide whether to roll back.

Per-step detail lives under runos jobs workitems <jobId> and runos jobs workitem-logs.

Logs

Live container logs are always available. They are read straight from the kubelet, for every service type and app, with no extra install.

runos apps logs <id> --tail 200 --since 600 --follow
runos services postgresql logs <id> --previous

Flags: --tail (lines from the end, default 100), --since (last N seconds), --previous (the previous container instance, useful after a crash), --follow (stream).

Historical log search is opt-in. It needs a system Vector DaemonSet (one per cluster); live logs work without it. Vector auto-provisions a ClickHouse (or reuses one you pass). Retention defaults to 7 days, set at create (1-90), so logs age out past the window. On a fresh cluster only live logs work until Vector is added:

runos services vector add

Once installed, runos services vector search-logs filters by namespace, pod, container, node, and substring (--search), with an optional time window (--since, or --start/--end).

Metrics

metrics-server is always on. It is installed when the cluster is configured and gives point-in-time CPU and memory, the same data as kubectl top.

Historical metrics are opt-in, one command away with the system Prometheus. Time-series resource metrics, PromQL query, and metric-name discovery (labels) all come from it. One system Prometheus per cluster:

runos services prometheus add
runos services postgresql resource-metrics <id>   # CPU/mem over time

Debugging a failed deploy

Work in order. Stop when you have the answer.

Status first. runos apps status <id> and runos jobs show <jobId>.
Runtime logs. runos apps logs <id> for a crash after start (missing env var, exception, OOMKilled).
Build logs. runos apps builds <id> for a CLI deploy, or runos apps github-builds / runos apps gitlab-builds for a VCS deploy that failed to build or push.
Historical logs. runos services vector search-logs (needs Vector) for a cross-service or time-bounded pattern.
Metrics. resource-metrics for memory pressure or OOM.

Note: cluster-reachability errors (agent not connected, in-cluster DNS, connection refused) surface as 503 "retry" and are not real failures. Retry them.

Alerting

Alertmanager is opt-in on the system Prometheus, one command away. Enable it, then choose a rule sensitivity profile:

runos services prometheus install-alertmanager
runos services prometheus set-alert-profile   # essential | standard | aggressive

The three profiles, essential, standard, and aggressive, set the baseline thresholds; you can pin individual rules on top. Configure receivers (webhook, email, slack), routing, and silences with set-receiver, set-routing, and create-silence. List what is firing with runos services prometheus alertmanager-alerts.

Certificates and ACME

TLS terminates at Traefik on port 443 (websecure), running as a host-network DaemonSet on each node. cert-manager issues the certificates (one instance per cluster). The cluster wildcard certificate uses a DNS-01 challenge.

When the cluster-domain issuer hits the rare cert-manager No Key ID in JWS header account mixup (ACME Orders fail to sign), heal it:

runos clusters acme-heal

It restarts cert-manager and deletes the failed Order so a fresh one is signed. It is idempotent and no-ops when the cert is already Ready, so it is safe to run anytime, including at renewal. Add --scope cluster-domain --cluster-domain-id <id> for a custom-domain wildcard.

Managed-service operations

The storage backend is immutable after create. It is chosen when you provision the service; changing it means recreating the service. openebs-local is the default (node-local, no replication). LINSTOR is the opt-in distributed backend (DRBD-replicated, survives node loss, online resize).

Backups are opt-in, not automatic. No backend backs itself up on a default schedule. For PostgreSQL, give it an S3-style destination and run configure-backup to wire up scheduled backups, a retention window, one-shot backups, and restore. Valkey is always ephemeral (emptyDir, no PVC), so it holds no durable state to back up.

Every managed service exposes a common operator surface, logs, status, show, update, and resource-metrics, plus type-specific verbs (for example PostgreSQL adds replication-status, promote, and relocate). PostgreSQL, MySQL, Valkey, MinIO, Harbor, Kafka, RabbitMQ, ClickHouse, and the AI services are all fully managed.

Accounts and credentials

Roles are admin and limited. New members default to limited (everything except key and user management). Manage members and invites:

runos account users list
runos account invite --email teammate@example.com --account-role limited
runos account invites list

Invites expire (24h), so revoke a wrong address before then. Account, role, and invite management requires a human login session; PATs are rejected on those routes.

Personal access tokens (PATs) are for automation against the API. Format runos_pat_<keyId>.<secret>, sent as Authorization: Bearer. A PAT is account-scoped (one account, all clusters), expiring (expiresAt is required, future, and at most 365 days), and shown once: the server stores only a hash, so a lost token is recreated, not recovered.

runos account api-keys add      # token printed once
runos account api-keys list
runos account api-keys revoke <id>

The CLI reads a PAT from the RUNOS_API_KEY env var. Use a limited key for CI.

Notify API keys authenticate email sends through the Notify service. They are passed as X-API-Key (not Bearer), are account-scoped, and cannot read or change any other resource.

runos account notify-keys add   # keyValue printed once
runos account notify-keys list
runos account notify-keys delete <id>

A send returns 202 when the email is queued, not delivered: there is no bounce or delivery signal. Notify keys have no expiry and are hard-deleted (not revoked), so rotate by delete-and-recreate.

Jobs (the async model)​

Logs​

Metrics​

Debugging a failed deploy​

Alerting​

Certificates and ACME​

Managed-service operations​

Accounts and credentials​