RunOS Agents
RunOS uses two types of agents to manage your Kubernetes infrastructure: the Node Agent and the Cluster Agent. These lightweight services work behind the scenes to handle server configuration, maintain connectivity, and automate operations without requiring manual intervention.
Node Agent
The Node Agent is a daemon that runs on each server in your cluster. It acts as the bridge between the RunOS Console and your infrastructure, handling everything from initial server setup to ongoing operations.
What It Does
Server Preparation
When you add a server to RunOS, the Node Agent automatically:
- Installs and configures Kubernetes components
- Sets up the container runtime (containerd)
- Configures networking and DNS resolution
- Establishes VPN connectivity to other cluster nodes
- Applies security configurations and kernel parameters
This transforms a basic Ubuntu server into a fully configured Kubernetes node without any manual steps.
Ongoing Operations
The Node Agent maintains your server's health and connectivity:
- Executes deployment commands from the Console
- Sends health status and metrics to the platform
- Manages VPN connections between servers
- Synchronizes configuration across the cluster
- Reports resource availability and pod status
Secure Communication
All communication uses certificate-based authentication with mutual TLS (mTLS) encryption:
- Registers with the platform using a short-lived token
- Receives dedicated mTLS certificates during registration
- Maintains persistent encrypted connection to RunOS backend
- Sends heartbeat messages every 5 seconds to prove health
- Automatically reconnects if the connection drops
VPN Management
The Node Agent manages two WireGuard VPN interfaces that connect your servers in a secure mesh:
wg0 - Kubernetes Internal Network (172.24.0.0/16)
- Connects all cluster nodes for pod-to-pod communication
- Encrypts all inter-node Kubernetes traffic
- Enables services to communicate across servers
- Uses full mesh topology (every node connects to every other node)
wg1 - User Access Network (172.24.200.0/21)
- Provides secure remote access for administrators
- Allows authorized users to access cluster services
- Uses hub-and-spoke topology (users connect to nodes, not each other)
The agent automatically synchronizes VPN peer configurations when servers join or leave your cluster, ensuring connectivity is always maintained without manual VPN management.
How It Connects
Registration (One-Time)
- You provide a registration token from the Console
- Agent contacts RunOS via HTTPS/TLS
- Token is validated against your account
- Agent receives mTLS certificates for secure communication
- Configuration saved to
/etc/runos/config.yaml
Operational Connection (Always-On)
- Maintains persistent mTLS connection to RunOS backend
- All connections are outbound from your server (no inbound ports needed)
- Sends heartbeat messages every 5 seconds
- Automatically reconnects with exponential backoff if connection drops
- If heartbeats stop, server appears offline in Console within 30 seconds
Resource Footprint
The Node Agent is designed to be efficient:
- CPU: Minimal, typically less than 1% on idle, brief spikes during operations
- Memory: ~50-100MB RAM
- Network: less than 1MB/hour for heartbeats and status updates
- Disk: Log rotation keeps disk usage under control
The overhead is negligible compared to your application workloads.
Monitoring the Agent
Check agent status:
# View agent status
./nodeagent status
# Check systemd service
sudo systemctl status nodeagent
# View recent logs
sudo tail -f /var/log/runos/nodeagent.log
# Verify VPN connectivity
sudo wg show
Healthy indicators:
- Agent status shows "Running"
- Recent heartbeat within last 5 seconds
- VPN interfaces up with peers connected
- No repeated reconnection attempts in logs
Node Agent Commands
The Node Agent provides several commands for managing your server. All commands require root privileges.
Common Commands:
# Register node with RunOS (one-time setup)
sudo ./nodeagent register --token <TOKEN> --aid <ACCOUNT_ID> --control-plane <0|1>
# Start the agent daemon (usually run via systemd)
sudo ./nodeagent agent
# Check system readiness before installation
sudo ./nodeagent preflight
# Install Kubernetes and join cluster (automated during setup)
sudo ./nodeagent install
# Synchronize VPN peers manually
sudo ./nodeagent sync vpn
# View agent status
./nodeagent status
# View formatted logs
./nodeagent logs
# Renew mTLS certificate
sudo ./nodeagent certificate renew
# Show version information
./nodeagent version
Registration Example:
When adding a server, you'll use the registration command with the token from the Console:
# Register as a worker node
sudo ./nodeagent register --token abc123xyz789 --aid acct_5678 --control-plane 0
# Register as a control plane node
sudo ./nodeagent register --token abc123xyz789 --aid acct_5678 --control-plane 1
Common Operations:
Most day-to-day operations don't require manual commands since the agent runs as a systemd service. However, you might use:
sync vpn- If VPN connectivity is lost and needs manual resyncstatus- To check agent health and connection statuslogs- To view formatted agent logs for troubleshooting
Configuration File:
After registration, configuration is stored in /etc/runos/config.yaml. This includes:
- Node identifier (nid)
- Account identifier (aid)
- Server endpoints
- Network settings
You should not need to modify this file manually.
Cluster Agent
The Cluster Agent runs inside your Kubernetes cluster (in the runos namespace) and handles cluster-level operations, primarily automatic SSL/TLS certificate management.
What It Does
Automatic SSL Certificate Management
The Cluster Agent's primary responsibility is managing the cluster domain certificate.
One-Time Certificate Request:
When your cluster is first created, the Cluster Agent requests a wildcard SSL certificate from Let's Encrypt for your cluster's subdomain (e.g., *.your-cluster.runos.io). This happens once during cluster initialization:
- cert-manager requests a wildcard certificate from Let's Encrypt
- Let's Encrypt sends a DNS-01 challenge to verify domain ownership
- Cluster Agent receives the challenge and coordinates with RunOS to update DNS
- Let's Encrypt verifies the DNS record and issues the wildcard certificate
- Certificate is stored in Kubernetes as a secret
Certificate Usage:
Once obtained, the wildcard certificate:
- Covers your cluster subdomain and all its subdomains
- Is stored securely in the cluster
- Gets copied to services and applications that need SSL/TLS
- Is automatically renewed 30 days before expiration
- Requires no manual management
What This Means:
Every service and application you deploy automatically gets SSL/TLS because they can use the stored wildcard certificate. You never need to request individual certificates - deploy as many services as you want and they all get instant HTTPS.
Benefits:
- One certificate covers unlimited applications
- Instant SSL for all new deployments
- Automatic renewal handled by the agent
- No certificate management overhead
Secure Cluster Operations
The agent performs authorized operations within your cluster:
- Managing Kubernetes secrets for platform integration
- Viewing pod status and deployment information
- Coordinating with cert-manager for certificate operations
- Maintaining persistent connection to RunOS backend
How It Connects
Initial Bootstrap
When first deployed, the agent temporarily uses the node's certificate for authentication, then receives its own dedicated mTLS certificate from RunOS backend. After that, it operates with its own credentials.
Persistent Connection
- Maintains always-on bidirectional connection to RunOS backend
- All traffic uses mTLS encryption
- Automatically reconnects if connection drops
- Regular heartbeat messages confirm connection health
This persistent connection allows real-time cluster health monitoring and coordinated certificate renewals.
Security and Permissions
What the Agent Can Do:
- Read and write secrets in the
runosnamespace - View pod information across the cluster
- Interact with cert-manager for certificates
- Manage certificate-related Kubernetes resources
What the Agent Cannot Do:
- Modify resources outside
runosnamespace without authorization - Access your application data or user information
- Make infrastructure changes without explicit authorization
All operations are authenticated via mTLS and can be traced through Kubernetes audit logs.
Resource Footprint
The Cluster Agent is lightweight:
- CPU: ~100m (limit: 200m)
- Memory: ~128MB (limit: 256MB)
- Network: Minimal bandwidth for heartbeats and operations
- Single Instance: Only one agent pod runs per cluster
Monitoring the Agent
Check agent status:
# View agent pod
kubectl get pods -n runos
# Check agent logs
kubectl logs -n runos deployment/runos-cluster-agent
# Verify health
kubectl describe pod -n runos -l app=runos-cluster-agent
Normal operation shows:
- Successful connection to RunOS backend
- Regular heartbeat responses
- Certificate challenge handling (during issuance/renewal)
How Agents Work Together
When you deploy an application through RunOS:
- You trigger deployment in the Console
- Console sends command to RunOS backend
- Backend routes to Node Agent on the appropriate server
- Node Agent executes Kubernetes deployment commands
- Kubernetes starts your application in containers
- Cluster Agent provisions SSL certificate for your app's domain
- Status flows back: Node Agent → Backend → Console
- You see the deployment succeed with secure HTTPS endpoint
This entire flow happens in seconds, with both agents coordinating automatically to deliver a fully configured, secure deployment.
Agent Communication Security
Both agents use multiple layers of security:
Certificate-Based Authentication:
- Agents present client certificates for all operations
- Server validates certificates before accepting commands
- Certificates can be revoked if compromised
Encrypted Communication:
- All traffic uses TLS/mTLS encryption
- VPN traffic uses WireGuard encryption
- No plaintext credentials in configuration
Minimal Attack Surface:
- No inbound network ports opened
- All connections initiated outbound
- No remote shell access provided
- Limited to authorized operations only
Best Practices
For Node Agents:
- Run as systemd service for automatic restart
- Monitor logs periodically for warnings
- Keep certificates valid and check expiration dates
- Maintain network connectivity to RunOS platform
- Don't modify configuration files manually unless necessary
For Cluster Agents:
- Don't modify agent deployment or service account permissions
- Monitor logs to ensure normal operation
- Maintain network connectivity to RunOS backend
- Keep cert-manager installed and properly configured
- Report unusual behavior to RunOS support
Troubleshooting Agents
Node Agent Issues:
- Server appears offline in Console
- Deployment commands not received
- VPN connectivity lost
Quick checks:
./nodeagent status
sudo systemctl status nodeagent
sudo wg show
sudo ./nodeagent sync vpn
Cluster Agent Issues:
- Certificates not being issued or renewed
- Agent pod not starting
Quick checks:
kubectl get pods -n runos
kubectl logs -n runos deployment/runos-cluster-agent
kubectl get certificaterequest -A
For detailed troubleshooting, contact RunOS support with relevant logs and error messages.
Cluster Autonomy and Agent Dependencies
Your Cluster is Independent
Important: Your Kubernetes cluster does NOT depend on RunOS or its agents to continue running. The cluster is fully autonomous once configured.
What happens if RunOS backend goes down:
- Your cluster continues running normally
- All applications and services remain operational
- Kubernetes manages workloads independently
- Existing VPN connections remain active
- No impact on cluster stability or performance
What happens if agents are stopped:
- Cluster continues operating without interruption
- Applications keep running as deployed
- Kubernetes continues managing pods and services
- You can still use kubectl and other tools directly
The agents are operational tools, not critical dependencies. They facilitate management but aren't required for cluster operation.
Agent Responsibilities
The agents only handle specific operational tasks:
Node Agent (runos binary) handles:
- Executing commands sent from the Console (user-initiated actions)
- Keeping control plane nodes synchronized in the cluster mesh
- Updating VPN peer configurations (wg0) when nodes join/leave
- Sending heartbeat and health metrics to the Console
Cluster Agent handles:
- Automatic SSL certificate renewal coordination
- Cluster-level operations triggered from the Console
- Maintaining cluster metadata in the RunOS platform
What continues without agents:
- All running applications and services
- Kubernetes cluster management and orchestration
- Pod scheduling and autoscaling
- Service networking and load balancing
- Storage management and volumes
- Existing VPN connections (wg0 and wg1)
Removing RunOS Agents
You can completely remove RunOS agents and even delete your RunOS account after cluster setup. Your cluster will continue functioning independently.
Safe removal process:
If you decide to remove the agents, stop them in close succession across all nodes:
# Stop agents on all nodes within a short timeframe
sudo systemctl stop runos
Why close succession matters:
The Node Agent manages control plane node discovery and VPN peer synchronization (wg0). If you stop agents gradually:
- A control plane node might miss heartbeats
- Other nodes may see it as offline
- The node could be removed from dnsmasq configuration
- This creates temporary inconsistencies in cluster DNS
Stopping all agents within a few minutes prevents these race conditions.
After removing agents:
- Cluster operates independently
- You manage all operations via kubectl
- No Console connectivity or automation
- Manual VPN peer management if nodes are added/removed
- Manual SSL certificate management
- No centralized monitoring or logging integration
When You Need the Agents
The agents are valuable for:
- Automated operations: Deploy services via Console without kubectl
- Cluster expansion: Add nodes seamlessly with automatic configuration
- VPN management: Automatic peer synchronization as cluster changes (wg0)
- SSL automation: Wildcard certificate provisioning and renewal
- Monitoring: Centralized visibility into cluster health
- Remote access: Secure access without complex firewall rules
Reconnecting Agents
If you stop agents temporarily and want to reconnect:
# Restart the node agent
sudo systemctl start runos
# Verify connection
./runos status
The agent will automatically:
- Reconnect to RunOS backend
- Resume sending heartbeats
- Sync VPN peer configurations
- Report current cluster state
No reconfiguration needed - the agent uses stored credentials from /etc/runos/config.yaml.
What You Should Know
OSID (Open Service Identifier)
Every service and application deployed in RunOS has a unique identifier called an OSID, in the format service-name-xxxxx (e.g., mysql-d6ekr).
Each OSID corresponds to a Kubernetes namespace where all related resources are deployed. When troubleshooting at the Kubernetes level, find the OSID in the RunOS Console and use it as the namespace:
# View all resources for a service
kubectl get all -n mysql-d6ekr
# Check pod logs
kubectl logs -n myapp-t7r4s <pod-name>
This organization keeps each service isolated and makes troubleshooting straightforward.