Skip to main content

Observability

RunOS provides comprehensive observability for your cluster through three main tools: the Pod Viewer for real-time pod monitoring, Log Viewer for application logs, and an integrated metrics suite using Prometheus and Grafana for detailed performance insights.

Pod Viewerโ€‹

The Pod Viewer gives you real-time visibility into all pods running in your cluster.

Accessing the Pod Viewerโ€‹

Via Console:

  1. Click Pods in the main navigation
  2. View all pods across your cluster
  3. Filter by namespace (OSID), status, or node

From Service/Application Pages:

  1. Navigate to a specific service or application
  2. Click the Pods tab
  3. View pods for that service/application only

Understanding Pod Statusโ€‹

Healthy States:

  • ๐ŸŸข Running - Pod operational and serving traffic
  • โšช Completed - Job finished successfully

Transitional States:

  • ๐ŸŸก Pending - Starting up, normal during deployment
  • ๐ŸŸก ContainerCreating - Pulling images or creating containers

Problem States:

  • ๐ŸŸ  CrashLoopBackOff - Container repeatedly crashing
  • ๐Ÿ”ด Failed - Pod could not start
  • ๐Ÿ”ด Error - Configuration or system issue

Quick Actionsโ€‹

Restart Pod:

  • Click the Restart button
  • Kubernetes automatically recreates the pod
  • Useful for recovering from errors

Note on Logs: The Pod Viewer is a representation of pods and their statuses. To view logs:

  • Navigate to the service management screen and click Logs tab
  • Or install Vector from the Services marketplace for comprehensive historical log viewing (see below)

Filter by:

  • Namespace (OSID) - Show pods for specific service/app
  • Status - Show only running, failed, or pending pods
  • Node - See all pods on a particular server

Search:

  • Search by pod name, namespace, or node
  • Partial matching supported
  • Real-time filtering as you type

Log Viewerโ€‹

Access logs from all your services and applications through basic streaming or advanced log aggregation.

Basic Logs (Always Available)โ€‹

View real-time logs from individual services without additional setup.

Accessing Basic Logs:

  1. Navigate to Services in the main navigation
  2. Click on the service you want to view logs for
  3. Click the Logs tab
  4. View streaming logs in real-time

Or via kubectl:

# View logs from specific pod
kubectl logs <pod-name> -n <OSID>

# Stream logs in real-time
kubectl logs <pod-name> -n <OSID> -f

# Last 100 lines
kubectl logs <pod-name> -n <OSID> --tail=100

# Previous container (if crashed)
kubectl logs <pod-name> -n <OSID> --previous

Basic Log Features:

  • Real-time streaming with auto-scroll
  • Search within visible content (Ctrl+F/Cmd+F)
  • Download logs to file
  • Tail control (last 50, 100, 500, 1000 lines)
  • Timestamp display

Limitations:

  • One pod at a time
  • Limited retention (rotated based on disk space)
  • No historical search
  • No cross-pod correlation

Advanced Log Aggregation (Vector)โ€‹

For centralized logs with full-text search across all pods, deploy Vector from the Services Marketplace. Vector uses ClickHouse as its storage backend, providing high-performance log storage and analytics capabilities.

What Vector Provides:

  • Centralized log collection from all pods across the cluster
  • Full-text search across entire cluster using ClickHouse
  • Historical log storage with configurable retention
  • Multi-pod log correlation
  • Export capabilities for compliance and analysis

Installing Vector:

  1. Navigate to Services โ†’ Marketplace
  2. Search for "Vector"
  3. Click Deploy Vector
  4. Configure:
    • Log retention (default: 7 days)
    • Storage size (default: 10Gi)
    • Storage class (OpenEBS or Longhorn)
  5. Click Deploy

Vector automatically deploys with ClickHouse as its database backend for storing and querying logs. After installation, a Logs menu item appears in the main navigation with full search capabilities.

For more details: See Vector Pipelines and ClickHouse in the Service Catalog.

Advanced Log Features:

Full-Text Search:

# Simple search
error

# Multiple terms (AND)
error database connection

# Phrase search
"connection refused"

# Field-specific search
namespace:mysql-prod AND level:error

# Time-based search
@timestamp:[2024-03-01 TO 2024-03-02] AND error

Filtering:

  • Filter by namespace (OSID)
  • Filter by pod name
  • Filter by container
  • Filter by log level (DEBUG, INFO, WARN, ERROR, FATAL)
  • Filter by time range (last 5m, 1h, 24h, 7d, custom)

Log Correlation:

  • Track requests across microservices by request ID
  • View all logs in a specific time window
  • Find all instances of specific error types

Export:

  • Export search results to CSV, JSON, or plain text
  • Bulk export for compliance or analysis

Log Best Practicesโ€‹

Effective Logging in Applications:

  • Use structured logging (JSON format preferred)
  • Include request ID, user ID, and correlation IDs
  • Use appropriate log levels (DEBUG, INFO, WARN, ERROR, FATAL)
  • Never log passwords, secrets, or PII

Log Retention:

  • Development: 1-3 days
  • Staging: 7-14 days
  • Production: 30-90 days (or more for compliance)

Avoid:

  • Logging large payloads (truncate or summarize)
  • Excessive DEBUG logging in production
  • Logging in tight loops (causes storage issues)

Metrics and Monitoringโ€‹

RunOS includes an integrated metrics suite using Prometheus for time-series data and Grafana for visualization.

What's Includedโ€‹

The metrics suite provides comprehensive insights into:

Service Metrics:

  • Resource usage (CPU, memory, disk I/O)
  • Request rates and latencies
  • Error rates
  • Connection counts
  • Query performance (for databases)

Node Metrics:

  • CPU utilization per node
  • Memory usage and available capacity
  • Disk usage and I/O statistics
  • Network traffic
  • System load

Application Metrics:

  • Custom application metrics
  • HTTP request metrics
  • Response times
  • Business metrics (if exposed)

Metrics Architectureโ€‹

Prometheus:

  • Time-series database for metrics
  • Automatically scrapes metrics from all pods
  • Stores metrics with configurable retention
  • Provides query language (PromQL)

Grafana:

  • Visualization and dashboarding
  • Pre-configured dashboards for all services
  • Customizable dashboards
  • Alerting capabilities

Accessing Metricsโ€‹

Via Console:

  • Key metrics displayed on service/application detail pages
  • Pod Viewer shows CPU and memory usage
  • Dashboard shows cluster-wide metrics

Via Grafana:

  1. Navigate to Services in main menu
  2. Find Grafana in your deployed services
  3. Click to open Grafana interface
  4. Browse pre-configured dashboards

Pre-configured Dashboards:

  • Cluster Overview - Node health and resource usage
  • Kubernetes Pods - Pod metrics across cluster
  • Node Exporter - Detailed node-level metrics
  • Service-Specific - Dashboards for PostgreSQL, MySQL, Redis, etc.

Understanding Metricsโ€‹

CPU Usage:

  • Measured in cores (e.g., 0.5 cores = 50% of one CPU)
  • Shows current usage vs. allocated resources
  • Warning if approaching limits

Memory Usage:

  • Measured in bytes (MB/GB)
  • Shows current usage vs. allocated limits
  • Critical if pod exceeds memory limit (may be killed)

Network I/O:

  • Bytes sent/received per second
  • Connection counts
  • Helps identify traffic patterns

Disk I/O:

  • Read/write operations per second
  • Latency metrics
  • Important for database performance

Custom Dashboardsโ€‹

Create custom Grafana dashboards for your specific needs:

  1. Open Grafana from Services
  2. Click + โ†’ Dashboard
  3. Add panels with PromQL queries
  4. Save dashboard for team access

Example PromQL Queries:

# CPU usage by pod
rate(container_cpu_usage_seconds_total[5m])

# Memory usage by namespace
container_memory_usage_bytes{namespace="your-osid"}

# HTTP request rate
rate(http_requests_total[5m])

# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Metrics Retentionโ€‹

Default retention:

  • Prometheus: 15 days
  • Grafana: Configurable per dashboard

Storage requirements:

  • Small cluster: 10-20Gi
  • Medium cluster: 50-100Gi
  • Large cluster: 200-500Gi

Alertingโ€‹

Set up alerts based on metrics thresholds (requires Grafana configuration):

Common alerts:

  • High CPU usage (above 80% for 5 minutes)
  • High memory usage (above 90%)
  • Pod restart rate (more than 3 in 10 minutes)
  • Service down (no metrics received)
  • Disk space low (less than 10% free)

Resource Usage at a Glanceโ€‹

Quickly assess resource usage across your cluster:

Node-Level:

  • View all nodes and their CPU/memory utilization
  • Identify overloaded nodes
  • Plan capacity expansion

Service-Level:

  • See resource usage per service
  • Identify resource-hungry services
  • Optimize resource allocations

Pod-Level:

  • Individual pod metrics in Pod Viewer
  • Compare pods to identify issues
  • Troubleshoot performance problems

Common Observability Tasksโ€‹

Debugging a Failed Deploymentโ€‹

  1. Check Pod Viewer for pod status
  2. View Events tab on pod details
  3. Check Logs for error messages
  4. Review Metrics for resource limits hit

Investigating Slow Performanceโ€‹

  1. Check Grafana dashboards for resource saturation
  2. View Logs for slow query warnings
  3. Check Pod Viewer for restart counts
  4. Review metrics for I/O bottlenecks

Monitoring Service Healthโ€‹

  1. View service in Services section
  2. Check pod count and status
  3. Review Metrics for error rates
  4. Set up Alerts for critical thresholds

Capacity Planningโ€‹

  1. Review Grafana cluster overview
  2. Check node resource trends over time
  3. Identify usage patterns
  4. Plan node additions before capacity is reached

Integration Between Toolsโ€‹

All observability tools work together:

From Pod Viewer:

  • See resource metrics inline for each pod
  • Identify problematic pods by status
  • Jump to service logs tab or Grafana for detailed analysis

From Service Logs:

  • View logs from all pods for a service
  • Correlate log events with pod restarts
  • Track issues across replica pods

From Vector (Advanced Logs):

  • Search across all services and pods
  • Correlate logs with metrics and events
  • Track requests across the entire cluster

From Grafana:

  • View metrics alongside service logs
  • Set up alerts based on log patterns and metrics
  • Create comprehensive dashboards combining logs and metrics

Best Practicesโ€‹

For Developers:

  • Implement health check endpoints
  • Expose custom metrics (Prometheus format)
  • Use structured logging
  • Include correlation IDs in logs

For Operations:

  • Review metrics daily for trends
  • Set up alerts for critical services
  • Monitor log volume for anomalies
  • Regular dashboard reviews with team

For Capacity Planning:

  • Track resource usage trends
  • Plan for peak loads
  • Monitor growth patterns
  • Add capacity proactively

Quick Referenceโ€‹

Pod Viewer kubectl equivalents:

# List all pods
kubectl get pods -A

# View pod details
kubectl describe pod <pod-name> -n <OSID>

# Check resource usage
kubectl top pod <pod-name> -n <OSID>

# View events
kubectl get events -n <OSID> --sort-by='.lastTimestamp'

Common PromQL queries:

# CPU usage
rate(container_cpu_usage_seconds_total{namespace="your-osid"}[5m])

# Memory usage
container_memory_usage_bytes{namespace="your-osid"}

# Pod count
count(kube_pod_info{namespace="your-osid"})

# Request rate
rate(http_requests_total[5m])

Troubleshooting Observabilityโ€‹

Metrics not showing:

  • Verify Prometheus is running: kubectl get pods -n monitoring
  • Check service monitors are configured
  • Ensure pods expose metrics endpoints

Logs not appearing:

  • Check pod is running
  • Verify Vector is deployed (for advanced logs)
  • Check log volume mounts have space

Grafana dashboards empty:

  • Verify Prometheus data source configured
  • Check time range selected in dashboard
  • Ensure metrics are being scraped

With these observability tools, you have complete visibility into your cluster's health, performance, and behavior.