RunOS Agents

RunOS uses two types of agents to manage your Kubernetes infrastructure: the Node Agent and the Cluster Agent. These lightweight services work behind the scenes to handle server configuration, maintain connectivity, and automate operations without requiring manual intervention.

Node Agent

The Node Agent is a daemon that runs on each server in your cluster. It acts as the bridge between the RunOS Console and your infrastructure, handling everything from initial server setup to ongoing operations.

What It Does

Server Preparation

When you add a server to RunOS, the Node Agent automatically:

Installs and configures Kubernetes components
Sets up the container runtime (containerd)
Configures networking and DNS resolution
Establishes VPN connectivity to other cluster nodes
Applies security configurations and kernel parameters

This transforms a basic Ubuntu server into a fully configured Kubernetes node without any manual steps.

Ongoing Operations

The Node Agent maintains your server's health and connectivity:

Executes deployment commands from the Console
Sends health status and metrics to the platform
Manages VPN connections between servers
Synchronizes configuration across the cluster
Reports resource availability and pod status

Secure Communication

All communication uses certificate-based authentication with mutual TLS (mTLS) encryption:

Registers with the platform using a short-lived token
Receives dedicated mTLS certificates during registration
Maintains persistent encrypted connection to RunOS backend
Sends heartbeat messages every 5 seconds to prove health
Automatically reconnects if the connection drops

VPN Management

The Node Agent manages two WireGuard VPN interfaces that connect your servers in a secure mesh:

wg0 - Kubernetes Internal Network (172.24.0.0/16)

Connects all cluster nodes for pod-to-pod communication
Encrypts all inter-node Kubernetes traffic
Enables services to communicate across servers
Uses full mesh topology (every node connects to every other node)

wg1 - User Access Network (172.24.200.0/21)

Provides secure remote access for administrators
Allows authorized users to access cluster services
Uses hub-and-spoke topology (users connect to nodes, not each other)

The agent automatically synchronizes VPN peer configurations when servers join or leave your cluster, ensuring connectivity is always maintained without manual VPN management.

How It Connects

Registration (One-Time)

You provide a registration token from the Console
Agent contacts RunOS via HTTPS/TLS
Token is validated against your account
Agent receives mTLS certificates for secure communication
Configuration saved to /etc/runos/config.yaml

Operational Connection (Always-On)

Maintains persistent mTLS connection to RunOS backend
All connections are outbound from your server (no inbound ports needed)
Sends heartbeat messages every 5 seconds
Automatically reconnects with exponential backoff if connection drops
If heartbeats stop, server appears offline in Console within 30 seconds

Resource Footprint

The Node Agent is designed to be efficient:

CPU: Minimal, typically less than 1% on idle, brief spikes during operations
Memory: ~50-100MB RAM
Network: less than 1MB/hour for heartbeats and status updates
Disk: Log rotation keeps disk usage under control

The overhead is negligible compared to your application workloads.

Monitoring the Agent

Check agent status:

# View agent status
./nodeagent status

# Check systemd service
sudo systemctl status nodeagent

# View recent logs
sudo tail -f /var/log/runos/nodeagent.log

# Verify VPN connectivity
sudo wg show

Healthy indicators:

Agent status shows "Running"
Recent heartbeat within last 5 seconds
VPN interfaces up with peers connected
No repeated reconnection attempts in logs

Node Agent Commands

The Node Agent provides several commands for managing your server. All commands require root privileges.

Common Commands:

# Register node with RunOS (one-time setup)
sudo ./nodeagent register --token <TOKEN> --aid <ACCOUNT_ID> --control-plane <0|1>

# Start the agent daemon (usually run via systemd)
sudo ./nodeagent agent

# Check system readiness before installation
sudo ./nodeagent preflight

# Install Kubernetes and join cluster (automated during setup)
sudo ./nodeagent install

# Synchronize VPN peers manually
sudo ./nodeagent sync vpn

# View agent status
./nodeagent status

# View formatted logs
./nodeagent logs

# Renew mTLS certificate
sudo ./nodeagent certificate renew

# Show version information
./nodeagent version

Registration Example:

When adding a server, you'll use the registration command with the token from the Console:

# Register as a worker node
sudo ./nodeagent register --token abc123xyz789 --aid acct_5678 --control-plane 0

# Register as a control plane node
sudo ./nodeagent register --token abc123xyz789 --aid acct_5678 --control-plane 1

Common Operations:

Most day-to-day operations don't require manual commands since the agent runs as a systemd service. However, you might use:

sync vpn - If VPN connectivity is lost and needs manual resync
status - To check agent health and connection status
logs - To view formatted agent logs for troubleshooting

Configuration File:

After registration, configuration is stored in /etc/runos/config.yaml. This includes:

Node identifier (nid)
Account identifier (aid)
Server endpoints
Network settings

You should not need to modify this file manually.

Cluster Agent

The Cluster Agent runs inside your Kubernetes cluster (in the runos namespace) and handles cluster-level operations, primarily automatic SSL/TLS certificate management.

What It Does

Automatic SSL Certificate Management

The Cluster Agent's primary responsibility is managing the cluster domain certificate.

One-Time Certificate Request:

When your cluster is first created, the Cluster Agent requests a wildcard SSL certificate from Let's Encrypt for your cluster's subdomain (e.g., *.your-cluster.runos.io). This happens once during cluster initialization:

cert-manager requests a wildcard certificate from Let's Encrypt
Let's Encrypt sends a DNS-01 challenge to verify domain ownership
Cluster Agent receives the challenge and coordinates with RunOS to update DNS
Let's Encrypt verifies the DNS record and issues the wildcard certificate
Certificate is stored in Kubernetes as a secret

Certificate Usage:

Once obtained, the wildcard certificate:

Covers your cluster subdomain and all its subdomains
Is stored securely in the cluster
Gets copied to services and applications that need SSL/TLS
Is automatically renewed 30 days before expiration
Requires no manual management

What This Means:

Every service and application you deploy automatically gets SSL/TLS because they can use the stored wildcard certificate. You never need to request individual certificates - deploy as many services as you want and they all get instant HTTPS.

Benefits:

One certificate covers unlimited applications
Instant SSL for all new deployments
Automatic renewal handled by the agent
No certificate management overhead

Secure Cluster Operations

The agent performs authorized operations within your cluster:

Managing Kubernetes secrets for platform integration
Viewing pod status and deployment information
Coordinating with cert-manager for certificate operations
Maintaining persistent connection to RunOS backend

How It Connects

Initial Bootstrap

When first deployed, the agent temporarily uses the node's certificate for authentication, then receives its own dedicated mTLS certificate from RunOS backend. After that, it operates with its own credentials.

Persistent Connection

Maintains always-on bidirectional connection to RunOS backend
All traffic uses mTLS encryption
Automatically reconnects if connection drops
Regular heartbeat messages confirm connection health

This persistent connection allows real-time cluster health monitoring and coordinated certificate renewals.

Security and Permissions

What the Agent Can Do:

Read and write secrets in the runos namespace
View pod information across the cluster
Interact with cert-manager for certificates
Manage certificate-related Kubernetes resources

What the Agent Cannot Do:

Modify resources outside runos namespace without authorization
Access your application data or user information
Make infrastructure changes without explicit authorization

All operations are authenticated via mTLS and can be traced through Kubernetes audit logs.

Resource Footprint

The Cluster Agent is lightweight:

CPU: ~100m (limit: 200m)
Memory: ~128MB (limit: 256MB)
Network: Minimal bandwidth for heartbeats and operations
Single Instance: Only one agent pod runs per cluster

Monitoring the Agent

Check agent status:

# View agent pod
kubectl get pods -n runos

# Check agent logs
kubectl logs -n runos deployment/runos-cluster-agent

# Verify health
kubectl describe pod -n runos -l app=runos-cluster-agent

Normal operation shows:

Successful connection to RunOS backend
Regular heartbeat responses
Certificate challenge handling (during issuance/renewal)

How Agents Work Together

When you deploy an application through RunOS:

You trigger deployment in the Console
Console sends command to RunOS backend
Backend routes to Node Agent on the appropriate server
Node Agent executes Kubernetes deployment commands
Kubernetes starts your application in containers
Cluster Agent provisions SSL certificate for your app's domain
Status flows back: Node Agent → Backend → Console
You see the deployment succeed with secure HTTPS endpoint

This entire flow happens in seconds, with both agents coordinating automatically to deliver a fully configured, secure deployment.

Agent Communication Security

Both agents use multiple layers of security:

Certificate-Based Authentication:

Agents present client certificates for all operations
Server validates certificates before accepting commands
Certificates can be revoked if compromised

Encrypted Communication:

All traffic uses TLS/mTLS encryption
VPN traffic uses WireGuard encryption
No plaintext credentials in configuration

Minimal Attack Surface:

No inbound network ports opened
All connections initiated outbound
No remote shell access provided
Limited to authorized operations only

Best Practices

For Node Agents:

Run as systemd service for automatic restart
Monitor logs periodically for warnings
Keep certificates valid and check expiration dates
Maintain network connectivity to RunOS platform
Don't modify configuration files manually unless necessary

For Cluster Agents:

Don't modify agent deployment or service account permissions
Monitor logs to ensure normal operation
Maintain network connectivity to RunOS backend
Keep cert-manager installed and properly configured
Report unusual behavior to RunOS support

Troubleshooting Agents

Node Agent Issues:

Server appears offline in Console
Deployment commands not received
VPN connectivity lost

Quick checks:

./nodeagent status
sudo systemctl status nodeagent
sudo wg show
sudo ./nodeagent sync vpn

Cluster Agent Issues:

Certificates not being issued or renewed
Agent pod not starting

Quick checks:

kubectl get pods -n runos
kubectl logs -n runos deployment/runos-cluster-agent
kubectl get certificaterequest -A

For detailed troubleshooting, contact RunOS support with relevant logs and error messages.

Cluster Autonomy and Agent Dependencies

Your Cluster is Independent

Important: Your Kubernetes cluster does NOT depend on RunOS or its agents to continue running. The cluster is fully autonomous once configured.

What happens if RunOS backend goes down:

Your cluster continues running normally
All applications and services remain operational
Kubernetes manages workloads independently
Existing VPN connections remain active
No impact on cluster stability or performance

What happens if agents are stopped:

Cluster continues operating without interruption
Applications keep running as deployed
Kubernetes continues managing pods and services
You can still use kubectl and other tools directly

The agents are operational tools, not critical dependencies. They facilitate management but aren't required for cluster operation.

Agent Responsibilities

The agents only handle specific operational tasks:

Node Agent (runos binary) handles:

Executing commands sent from the Console (user-initiated actions)
Keeping control plane nodes synchronized in the cluster mesh
Updating VPN peer configurations (wg0) when nodes join/leave
Sending heartbeat and health metrics to the Console

Cluster Agent handles:

Automatic SSL certificate renewal coordination
Cluster-level operations triggered from the Console
Maintaining cluster metadata in the RunOS platform

What continues without agents:

All running applications and services
Kubernetes cluster management and orchestration
Pod scheduling and autoscaling
Service networking and load balancing
Storage management and volumes
Existing VPN connections (wg0 and wg1)

Removing RunOS Agents

You can completely remove RunOS agents and even delete your RunOS account after cluster setup. Your cluster will continue functioning independently.

Safe removal process:

If you decide to remove the agents, stop them in close succession across all nodes:

# Stop agents on all nodes within a short timeframe
sudo systemctl stop runos

Why close succession matters:

The Node Agent manages control plane node discovery and VPN peer synchronization (wg0). If you stop agents gradually:

A control plane node might miss heartbeats
Other nodes may see it as offline
The node could be removed from dnsmasq configuration
This creates temporary inconsistencies in cluster DNS

Stopping all agents within a few minutes prevents these race conditions.

After removing agents:

Cluster operates independently
You manage all operations via kubectl
No Console connectivity or automation
Manual VPN peer management if nodes are added/removed
Manual SSL certificate management
No centralized monitoring or logging integration

When You Need the Agents

The agents are valuable for:

Automated operations: Deploy services via Console without kubectl
Cluster expansion: Add nodes seamlessly with automatic configuration
VPN management: Automatic peer synchronization as cluster changes (wg0)
SSL automation: Wildcard certificate provisioning and renewal
Monitoring: Centralized visibility into cluster health
Remote access: Secure access without complex firewall rules

Reconnecting Agents

If you stop agents temporarily and want to reconnect:

# Restart the node agent
sudo systemctl start runos

# Verify connection
./runos status

The agent will automatically:

Reconnect to RunOS backend
Resume sending heartbeats
Sync VPN peer configurations
Report current cluster state

No reconfiguration needed - the agent uses stored credentials from /etc/runos/config.yaml.

What You Should Know

OSID (Open Service Identifier)

Every service and application deployed in RunOS has a unique identifier called an OSID, in the format service-name-xxxxx (e.g., mysql-d6ekr).

Each OSID corresponds to a Kubernetes namespace where all related resources are deployed. When troubleshooting at the Kubernetes level, find the OSID in the RunOS Console and use it as the namespace:

# View all resources for a service
kubectl get all -n mysql-d6ekr

# Check pod logs
kubectl logs -n myapp-t7r4s <pod-name>

This organization keeps each service isolated and makes troubleshooting straightforward.

Node Agent​

What It Does​

VPN Management​

How It Connects​

Resource Footprint​

Monitoring the Agent​

Node Agent Commands​

Cluster Agent​

What It Does​

How It Connects​

Security and Permissions​

Resource Footprint​

Monitoring the Agent​

How Agents Work Together​

Agent Communication Security​

Best Practices​

Troubleshooting Agents​

Cluster Autonomy and Agent Dependencies​

Your Cluster is Independent​

Agent Responsibilities​

Removing RunOS Agents​

When You Need the Agents​

Reconnecting Agents​

What You Should Know​

Node Agent

What It Does

VPN Management

How It Connects

Resource Footprint

Monitoring the Agent

Node Agent Commands

Cluster Agent

What It Does

How It Connects

Security and Permissions

Resource Footprint

Monitoring the Agent

How Agents Work Together

Agent Communication Security

Best Practices

Troubleshooting Agents

Cluster Autonomy and Agent Dependencies

Your Cluster is Independent

Agent Responsibilities

Removing RunOS Agents

When You Need the Agents

Reconnecting Agents

What You Should Know