Observability

RunOS provides comprehensive observability for your cluster through three main tools: the Pod Viewer for real-time pod monitoring, Log Viewer for application logs, and an integrated metrics suite using Prometheus and Grafana for detailed performance insights.

Pod Viewer

The Pod Viewer gives you real-time visibility into all pods running in your cluster.

Accessing the Pod Viewer

Via Console:

Click Pods in the main navigation
View all pods across your cluster
Filter by namespace (OSID), status, or node

From Service/Application Pages:

Navigate to a specific service or application
Click the Pods tab
View pods for that service/application only

Understanding Pod Status

Healthy States:

🟢 Running - Pod operational and serving traffic
⚪ Completed - Job finished successfully

Transitional States:

🟡 Pending - Starting up, normal during deployment
🟡 ContainerCreating - Pulling images or creating containers

Problem States:

🟠 CrashLoopBackOff - Container repeatedly crashing
🔴 Failed - Pod could not start
🔴 Error - Configuration or system issue

Quick Actions

Restart Pod:

Click the Restart button
Kubernetes automatically recreates the pod
Useful for recovering from errors

Note on Logs: The Pod Viewer is a representation of pods and their statuses. To view logs:

Navigate to the service management screen and click Logs tab
Or install Vector from the Services marketplace for comprehensive historical log viewing (see below)

Filtering and Search

Filter by:

Namespace (OSID) - Show pods for specific service/app
Status - Show only running, failed, or pending pods
Node - See all pods on a particular server

Search:

Search by pod name, namespace, or node
Partial matching supported
Real-time filtering as you type

Log Viewer

Access logs from all your services and applications through basic streaming or advanced log aggregation.

Basic Logs (Always Available)

View real-time logs from individual services without additional setup.

Accessing Basic Logs:

Navigate to Services in the main navigation
Click on the service you want to view logs for
Click the Logs tab
View streaming logs in real-time

Or via kubectl:

# View logs from specific pod
kubectl logs <pod-name> -n <OSID>

# Stream logs in real-time
kubectl logs <pod-name> -n <OSID> -f

# Last 100 lines
kubectl logs <pod-name> -n <OSID> --tail=100

# Previous container (if crashed)
kubectl logs <pod-name> -n <OSID> --previous

Basic Log Features:

Real-time streaming with auto-scroll
Search within visible content (Ctrl+F/Cmd+F)
Download logs to file
Tail control (last 50, 100, 500, 1000 lines)
Timestamp display

Limitations:

One pod at a time
Limited retention (rotated based on disk space)
No historical search
No cross-pod correlation

Advanced Log Aggregation (Vector)

For centralized logs with full-text search across all pods, deploy Vector from the Services Marketplace. Vector uses ClickHouse as its storage backend, providing high-performance log storage and analytics capabilities.

What Vector Provides:

Centralized log collection from all pods across the cluster
Full-text search across entire cluster using ClickHouse
Historical log storage with configurable retention
Multi-pod log correlation
Export capabilities for compliance and analysis

Installing Vector:

Navigate to Services → Marketplace
Search for "Vector"
Click Deploy Vector
Configure:
- Log retention (default: 7 days)
- Storage size (default: 10Gi)
- Storage class (OpenEBS or Longhorn)
Click Deploy

Vector automatically deploys with ClickHouse as its database backend for storing and querying logs. After installation, a Logs menu item appears in the main navigation with full search capabilities.

For more details: See Vector Pipelines and ClickHouse in the Service Catalog.

Advanced Log Features:

Full-Text Search:

# Simple search
error

# Multiple terms (AND)
error database connection

# Phrase search
"connection refused"

# Field-specific search
namespace:mysql-prod AND level:error

# Time-based search
@timestamp:[2024-03-01 TO 2024-03-02] AND error

Filtering:

Filter by namespace (OSID)
Filter by pod name
Filter by container
Filter by log level (DEBUG, INFO, WARN, ERROR, FATAL)
Filter by time range (last 5m, 1h, 24h, 7d, custom)

Log Correlation:

Track requests across microservices by request ID
View all logs in a specific time window
Find all instances of specific error types

Export:

Export search results to CSV, JSON, or plain text
Bulk export for compliance or analysis

Log Best Practices

Effective Logging in Applications:

Use structured logging (JSON format preferred)
Include request ID, user ID, and correlation IDs
Use appropriate log levels (DEBUG, INFO, WARN, ERROR, FATAL)
Never log passwords, secrets, or PII

Log Retention:

Development: 1-3 days
Staging: 7-14 days
Production: 30-90 days (or more for compliance)

Avoid:

Logging large payloads (truncate or summarize)
Excessive DEBUG logging in production
Logging in tight loops (causes storage issues)

Metrics and Monitoring

RunOS includes an integrated metrics suite using Prometheus for time-series data and Grafana for visualization.

What's Included

The metrics suite provides comprehensive insights into:

Service Metrics:

Resource usage (CPU, memory, disk I/O)
Request rates and latencies
Error rates
Connection counts
Query performance (for databases)

Node Metrics:

CPU utilization per node
Memory usage and available capacity
Disk usage and I/O statistics
Network traffic
System load

Application Metrics:

Custom application metrics
HTTP request metrics
Response times
Business metrics (if exposed)

Metrics Architecture

Prometheus:

Time-series database for metrics
Automatically scrapes metrics from all pods
Stores metrics with configurable retention
Provides query language (PromQL)

Grafana:

Visualization and dashboarding
Pre-configured dashboards for all services
Customizable dashboards
Alerting capabilities

Accessing Metrics

Via Console:

Key metrics displayed on service/application detail pages
Pod Viewer shows CPU and memory usage
Dashboard shows cluster-wide metrics

Via Grafana:

Navigate to Services in main menu
Find Grafana in your deployed services
Click to open Grafana interface
Browse pre-configured dashboards

Pre-configured Dashboards:

Cluster Overview - Node health and resource usage
Kubernetes Pods - Pod metrics across cluster
Node Exporter - Detailed node-level metrics
Service-Specific - Dashboards for PostgreSQL, MySQL, Redis, etc.

Understanding Metrics

CPU Usage:

Measured in cores (e.g., 0.5 cores = 50% of one CPU)
Shows current usage vs. allocated resources
Warning if approaching limits

Memory Usage:

Measured in bytes (MB/GB)
Shows current usage vs. allocated limits
Critical if pod exceeds memory limit (may be killed)

Network I/O:

Bytes sent/received per second
Connection counts
Helps identify traffic patterns

Disk I/O:

Read/write operations per second
Latency metrics
Important for database performance

Custom Dashboards

Create custom Grafana dashboards for your specific needs:

Open Grafana from Services
Click + → Dashboard
Add panels with PromQL queries
Save dashboard for team access

Example PromQL Queries:

# CPU usage by pod
rate(container_cpu_usage_seconds_total[5m])

# Memory usage by namespace
container_memory_usage_bytes{namespace="your-osid"}

# HTTP request rate
rate(http_requests_total[5m])

# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Metrics Retention

Default retention:

Prometheus: 15 days
Grafana: Configurable per dashboard

Storage requirements:

Small cluster: 10-20Gi
Medium cluster: 50-100Gi
Large cluster: 200-500Gi

Alerting

Set up alerts based on metrics thresholds (requires Grafana configuration):

Common alerts:

High CPU usage (above 80% for 5 minutes)
High memory usage (above 90%)
Pod restart rate (more than 3 in 10 minutes)
Service down (no metrics received)
Disk space low (less than 10% free)

Resource Usage at a Glance

Quickly assess resource usage across your cluster:

Node-Level:

View all nodes and their CPU/memory utilization
Identify overloaded nodes
Plan capacity expansion

Service-Level:

See resource usage per service
Identify resource-hungry services
Optimize resource allocations

Pod-Level:

Individual pod metrics in Pod Viewer
Compare pods to identify issues
Troubleshoot performance problems

Common Observability Tasks

Debugging a Failed Deployment

Check Pod Viewer for pod status
View Events tab on pod details
Check Logs for error messages
Review Metrics for resource limits hit

Investigating Slow Performance

Check Grafana dashboards for resource saturation
View Logs for slow query warnings
Check Pod Viewer for restart counts
Review metrics for I/O bottlenecks

Monitoring Service Health

View service in Services section
Check pod count and status
Review Metrics for error rates
Set up Alerts for critical thresholds

Capacity Planning

Review Grafana cluster overview
Check node resource trends over time
Identify usage patterns
Plan node additions before capacity is reached

Integration Between Tools

All observability tools work together:

From Pod Viewer:

See resource metrics inline for each pod
Identify problematic pods by status
Jump to service logs tab or Grafana for detailed analysis

From Service Logs:

View logs from all pods for a service
Correlate log events with pod restarts
Track issues across replica pods

From Vector (Advanced Logs):

Search across all services and pods
Correlate logs with metrics and events
Track requests across the entire cluster

From Grafana:

View metrics alongside service logs
Set up alerts based on log patterns and metrics
Create comprehensive dashboards combining logs and metrics

Best Practices

For Developers:

Implement health check endpoints
Expose custom metrics (Prometheus format)
Use structured logging
Include correlation IDs in logs

For Operations:

Review metrics daily for trends
Set up alerts for critical services
Monitor log volume for anomalies
Regular dashboard reviews with team

For Capacity Planning:

Track resource usage trends
Plan for peak loads
Monitor growth patterns
Add capacity proactively

Quick Reference

Pod Viewer kubectl equivalents:

# List all pods
kubectl get pods -A

# View pod details
kubectl describe pod <pod-name> -n <OSID>

# Check resource usage
kubectl top pod <pod-name> -n <OSID>

# View events
kubectl get events -n <OSID> --sort-by='.lastTimestamp'

Common PromQL queries:

# CPU usage
rate(container_cpu_usage_seconds_total{namespace="your-osid"}[5m])

# Memory usage
container_memory_usage_bytes{namespace="your-osid"}

# Pod count
count(kube_pod_info{namespace="your-osid"})

# Request rate
rate(http_requests_total[5m])

Troubleshooting Observability

Metrics not showing:

Verify Prometheus is running: kubectl get pods -n monitoring
Check service monitors are configured
Ensure pods expose metrics endpoints

Logs not appearing:

Check pod is running
Verify Vector is deployed (for advanced logs)
Check log volume mounts have space

Grafana dashboards empty:

Verify Prometheus data source configured
Check time range selected in dashboard
Ensure metrics are being scraped

With these observability tools, you have complete visibility into your cluster's health, performance, and behavior.

Pod Viewer​

Accessing the Pod Viewer​

Understanding Pod Status​

Quick Actions​

Filtering and Search​

Log Viewer​

Basic Logs (Always Available)​

Advanced Log Aggregation (Vector)​

Log Best Practices​

Metrics and Monitoring​

What's Included​

Metrics Architecture​

Accessing Metrics​

Understanding Metrics​

Custom Dashboards​

Metrics Retention​

Alerting​

Resource Usage at a Glance​

Common Observability Tasks​

Debugging a Failed Deployment​

Investigating Slow Performance​

Monitoring Service Health​

Capacity Planning​

Integration Between Tools​

Best Practices​

Quick Reference​

Troubleshooting Observability​