System Architecture

RunOS is designed to abstract away Kubernetes complexity while leveraging its reliability and power. This document explains how the system works, how components communicate, and what makes RunOS different from traditional Kubernetes deployments.

What RunOS Provides

RunOS gives you all the benefits of Kubernetes without the operational burden:

One-click service deployments - PostgreSQL, Redis, Kafka, and 20+ services deploy instantly
Git-based application deployment - Push code, RunOS handles builds, deployments, domains, and SSL
Infrastructure flexibility - Run on any cloud provider, bare-metal, or on-premises
Automatic management - Kubernetes clusters created and managed behind the scenes
Service intelligence - Applications automatically discover and connect to services

Core Components

Console (Web Application)

Your primary interface for managing everything in RunOS.

What you do here:

Add and manage servers
Deploy applications from Git repositories
Install services like databases and message queues
Monitor cluster health and performance
View logs and metrics
Configure domains and SSL certificates

Think of the Console as mission control - where you make decisions and see results.

Node Agent (On Your Servers)

Lightweight daemon running on each server in your cluster.

Responsibilities:

Prepares servers during initial setup
Maintains secure connections to RunOS platform
Manages VPN connectivity between servers
Executes deployment commands
Reports health status and metrics

When you add a server through the Console, the Node Agent handles all technical setup automatically.

Cluster Agent (In Kubernetes)

Service running inside your cluster in the runos namespace.

Responsibilities:

Automatically provisions SSL/TLS certificates
Renews certificates before expiration
Maintains secure communication with RunOS platform
Handles cluster-level administrative tasks

Works quietly in the background to keep your cluster secure and operational.

RunOS Backend

Central communication hub connecting everything together.

What it does:

Routes commands from Console to Node Agents
Collects status updates and metrics from servers
Manages secure mTLS communication channels
Handles authentication and authorization

You don't interact with it directly - it's the invisible messenger making everything work seamlessly.

Templates Service

Provides configurations for all deployable services.

What it provides:

Kubernetes configuration templates for supported services
Intelligent defaults for service deployments
Tiered configuration options (lightweight, HA, enterprise)
Best-practice configurations

When you deploy PostgreSQL or any service, Templates provides the battle-tested configuration.

How Components Communicate

Secure Communication

RunOS uses encrypted communication with certificate-based authentication throughout the platform.

Initial Registration

Used when new servers join your cluster
HTTPS/TLS encrypted communication
Token-based authentication
One-time setup process

Ongoing Operations

Mutual TLS (mTLS) authentication
Both client and server verify certificates
All Console ↔ Backend ↔ Agent communication
Prevents man-in-the-middle attacks

VPN Layer

WireGuard encryption for all inter-server traffic
Modern cryptography (ChaCha20, Poly1305, Curve25519)
Faster and more secure than IPsec or OpenVPN
Perfect forward secrecy

Communication Flows

When You Deploy an Application:

1. You (Web Browser)
        ↓ HTTPS
   Console Frontend
        ↓ REST API
   Console Backend
        ↓ Authentication check
   Validated request
        ↓ mTLS encrypted connection
   RunOS Backend
        ↓ Routes to appropriate server(s)
   Node Agent(s)
        ↓ Executes kubectl commands
   Kubernetes creates pods
        ↓ Status flows back
   Console shows success

Timeline: Typically 5-15 seconds from clicking "Deploy" to seeing your app running.

Server Health Monitoring:

Node Agent (every 5 seconds)
    ↓ Heartbeat via mTLS
RunOS Backend
    ↓ Updates internal state
Stores latest status
    ↓ When Console requests
Returns current status
    ↓ REST API response
Console displays real-time health

SSL Certificate Issuance:

cert-manager (in cluster)
    ↓ Requests certificate
Let's Encrypt
    ↓ DNS-01 challenge
Cluster Agent (webhook)
    ↓ mTLS connection
RunOS Backend
    ↓ Updates DNS
Let's Encrypt verifies
    ↓ Issues certificate
Available to your apps

Timeline: Usually 1-2 minutes for initial issuance, renewals happen automatically in the background.

Server-to-Server Communication

All servers in your cluster are connected via WireGuard VPN creating an encrypted mesh network:

Server A (172.24.1.10)  ←→  WireGuard  ←→  Server B (172.24.1.20)
       ↓                                            ↓
   Pod Network                                 Pod Network
   (172.25.1.0/24)                             (172.25.2.0/24)

Why VPN:

Works behind NAT without public IPs
Encrypts all inter-server traffic
Simplifies network configuration
Provides stable IP addresses for nodes

Network Architecture:

wg0 (172.24.0.0/16) - Kubernetes internal traffic
wg1 (172.24.200.0/21) - User access network
Pod network (172.25.0.0/16) - Container IPs
Service network (10.96.0.0/12) - Kubernetes service IPs

Data Flow Examples

Deploying a Service (PostgreSQL)

You click "Deploy PostgreSQL" in Console
Console validates cluster has sufficient resources
Templates provides PostgreSQL Kubernetes manifests
RunOS backend sends deployment instructions via mTLS to Node Agents
Node Agents execute Kubernetes deployment
Kubernetes starts PostgreSQL pods
Cluster Agent provisions SSL certificate
Status flows back: Agents → Backend → Console
You see PostgreSQL running with connection details

All in seconds, fully configured and secure.

Deploying an Application

You point to your Git repository
Console builds container image (GitHub Actions or in-cluster BuildKit)
Image pushed to local Harbor registry
Console sends deployment command via RunOS backend
Node Agents execute Kubernetes deployment
Kubernetes pulls image from Harbor and starts pods
Cluster Agent provisions SSL certificate for your domain
Traefik ingress routes traffic to your application
Console displays application URL and status

Service Discovery

You deploy an application that needs PostgreSQL
During deployment, RunOS lists available PostgreSQL instances
You select a compatible instance
RunOS injects connection credentials as environment variables
Your application automatically connects using those variables
Kubernetes internal DNS routes traffic between services

Understanding Key Concepts

Kubernetes (Hidden But Present)

You don't need to know Kubernetes to use RunOS, but understanding a few concepts helps:

Pods - Running instances of your applications or services
Services - Network endpoints that route traffic to pods
Ingress - Routes external traffic to your services
Persistent Volumes - Storage that persists even if pods restart
Namespaces - Isolated environments for resources

The Console shows these in user-friendly terms, but knowing the underlying primitives helps with advanced troubleshooting.

OSID (Open Service Identifier)

Every service and application in RunOS has a unique identifier in the format: service-name-xxxxx

Examples:

mysql-d6ekr - A MySQL database instance
postgres-k9m3w - A PostgreSQL database
myapp-t7r4s - A custom application

Why it matters: Each OSID is also a Kubernetes namespace containing all resources for that service/application. When you need to troubleshoot at the Kubernetes level:

# View all resources for a service
kubectl get all -n mysql-d6ekr

# Check pods for your application
kubectl get pods -n myapp-t7r4s

# View logs
kubectl logs -n postgres-k9m3w <pod-name>

This organization keeps services isolated and makes troubleshooting straightforward.

Your Infrastructure, Your Control

RunOS runs on servers you provide and control:

You choose where servers are located (cloud, on-premises, hybrid)
You maintain physical or virtual machine access
You can access Kubernetes directly if needed (kubectl works)
All your data stays on your infrastructure

RunOS manages the Kubernetes control plane and workloads, but you retain root access to your servers.

Network Architecture

IP Addressing

Each server receives multiple IP addresses:

Physical network IP - Your server's actual network interface
wg0 VPN IP - Kubernetes internal (172.24.X.X)
wg1 VPN IP - User access (172.24.200.X)
Pod CIDR - Range for pods on this node (172.25.X.X/24)

DNS Resolution

Application queries database.default.svc.cluster.local
    ↓
systemd-resolved
    ↓
dnsmasq (on wg0 VPN IP)
    ↓
CoreDNS (Kubernetes DNS)
    ↓
Returns ClusterIP (10.96.X.X)
    ↓
kube-proxy routes to pod
    ↓
PostgreSQL Pod (172.25.X.X)

External domains resolve through dnsmasq → Cloudflare/Google DNS.

Traffic Routing

Pod-to-Pod (Same Node):

Pod A → Container bridge → Pod B

Pod-to-Pod (Different Nodes):

Pod A → Cilium CNI → wg0 VPN → Encrypted tunnel → Target Node → Pod B

All inter-node traffic flows through encrypted VPN tunnels automatically.

Connection Reliability

Automatic Reconnection

If connections are interrupted, agents automatically reconnect:

Connection lost detected
Wait with exponential backoff (starts at 1 second)
Attempt reconnection
If successful, resume normal operations
If failed, increase wait time (max 60 seconds)

Why exponential backoff:

Prevents connection storms
Gives network time to recover
Reduces load on control plane
Handles temporary outages gracefully

Health Monitoring

Heartbeats sent every 5 seconds
Healthy: Heartbeats received within 10 seconds
Degraded: Heartbeats delayed 10-30 seconds
Offline: No heartbeat for 30+ seconds

If network returns, agents reconnect automatically and resume normal operation.

Firewall Requirements

Required Outbound Access

Your servers need outbound access to:

Service	Protocol	Purpose
runos.com	HTTPS (443)	Platform communication
get.runos.com	HTTPS (443)	Component downloads
Other cluster nodes	UDP (51820/51821)	VPN connectivity

Important: No inbound ports required - all connections initiated outbound from your servers.

Behind NAT

RunOS works behind NAT without special configuration:

All connections are outbound-initiated
NAT allows return traffic automatically
WireGuard uses UDP hole-punching for peer-to-peer
Persistent keepalive maintains NAT state

Supported scenarios:

Home networks behind residential NAT
Corporate networks with firewall
Cloud VPCs with private subnets
Hybrid setups across multiple networks

Security Architecture

Multiple Layers of Protection

Certificate-based authentication:

All agents present client certificates
Certificates validated before accepting commands
Certificates can be revoked if compromised

Encrypted communication:

TLS/mTLS for all platform communication
WireGuard VPN for inter-server traffic
No plaintext credentials stored

Minimal attack surface:

No inbound network ports opened
All connections initiated outbound
No remote shell access
Limited to authorized operations only

Secrets Management

All secrets encrypted at rest by default
Encryption handled transparently by Kubernetes
AES-CBC encryption with random keys
Access controlled by namespace
Never exposed in logs or UI

What Makes RunOS Different

Traditional Kubernetes:

Manual cluster setup and configuration
Complex networking and certificate management
Manual service deployments with YAML files
Requires deep Kubernetes expertise

RunOS:

Automatic cluster setup and configuration
Automatic networking and certificate management
One-click service deployments
Kubernetes expertise optional

You get all the benefits of Kubernetes (reliability, scalability, ecosystem) without the operational complexity.

Performance Characteristics

Network Overhead

Node Agent communication:

~1 MB/hour per node for heartbeats and status
Command traffic variable, typically less than 10 KB each

VPN overhead:

WireGuard adds ~60 bytes per packet
Encryption/decryption very fast (negligible CPU)
Typical overhead: 1-5%

Network requirements:

Minimum: 128 Kbps per node
Recommended: 1 Mbps or higher per node
Latency tolerance: less than 500ms acceptable

Resource Usage

Per Server:

Node Agent: ~50-100MB RAM, less than 1% CPU
Cluster Agent (one per cluster): ~128MB RAM, ~100m CPU
VPN: Minimal overhead
Total platform overhead: less than 200MB RAM per server

This leaves the vast majority of server resources available for your applications.

What RunOS Provides​

Core Components​

Console (Web Application)​

Node Agent (On Your Servers)​

Cluster Agent (In Kubernetes)​

RunOS Backend​

Templates Service​

How Components Communicate​

Secure Communication​

Communication Flows​

Server-to-Server Communication​

Data Flow Examples​

Deploying a Service (PostgreSQL)​

Deploying an Application​

Service Discovery​

Understanding Key Concepts​

Kubernetes (Hidden But Present)​

OSID (Open Service Identifier)​

Your Infrastructure, Your Control​

Network Architecture​

IP Addressing​

DNS Resolution​

Traffic Routing​

Connection Reliability​

Automatic Reconnection​

Health Monitoring​

Firewall Requirements​

Required Outbound Access​

Behind NAT​

Security Architecture​

Multiple Layers of Protection​

Secrets Management​

What Makes RunOS Different​

Performance Characteristics​

Network Overhead​

Resource Usage​