Observability
RunOS provides comprehensive observability for your cluster through three main tools: the Pod Viewer for real-time pod monitoring, Log Viewer for application logs, and an integrated metrics suite using Prometheus and Grafana for detailed performance insights.
Pod Viewerโ
The Pod Viewer gives you real-time visibility into all pods running in your cluster.
Accessing the Pod Viewerโ
Via Console:
- Click Pods in the main navigation
- View all pods across your cluster
- Filter by namespace (OSID), status, or node
From Service/Application Pages:
- Navigate to a specific service or application
- Click the Pods tab
- View pods for that service/application only
Understanding Pod Statusโ
Healthy States:
- ๐ข Running - Pod operational and serving traffic
- โช Completed - Job finished successfully
Transitional States:
- ๐ก Pending - Starting up, normal during deployment
- ๐ก ContainerCreating - Pulling images or creating containers
Problem States:
- ๐ CrashLoopBackOff - Container repeatedly crashing
- ๐ด Failed - Pod could not start
- ๐ด Error - Configuration or system issue
Quick Actionsโ
Restart Pod:
- Click the Restart button
- Kubernetes automatically recreates the pod
- Useful for recovering from errors
Note on Logs: The Pod Viewer is a representation of pods and their statuses. To view logs:
- Navigate to the service management screen and click Logs tab
- Or install Vector from the Services marketplace for comprehensive historical log viewing (see below)
Filtering and Searchโ
Filter by:
- Namespace (OSID) - Show pods for specific service/app
- Status - Show only running, failed, or pending pods
- Node - See all pods on a particular server
Search:
- Search by pod name, namespace, or node
- Partial matching supported
- Real-time filtering as you type
Log Viewerโ
Access logs from all your services and applications through basic streaming or advanced log aggregation.
Basic Logs (Always Available)โ
View real-time logs from individual services without additional setup.
Accessing Basic Logs:
- Navigate to Services in the main navigation
- Click on the service you want to view logs for
- Click the Logs tab
- View streaming logs in real-time
Or via kubectl:
# View logs from specific pod
kubectl logs <pod-name> -n <OSID>
# Stream logs in real-time
kubectl logs <pod-name> -n <OSID> -f
# Last 100 lines
kubectl logs <pod-name> -n <OSID> --tail=100
# Previous container (if crashed)
kubectl logs <pod-name> -n <OSID> --previous
Basic Log Features:
- Real-time streaming with auto-scroll
- Search within visible content (Ctrl+F/Cmd+F)
- Download logs to file
- Tail control (last 50, 100, 500, 1000 lines)
- Timestamp display
Limitations:
- One pod at a time
- Limited retention (rotated based on disk space)
- No historical search
- No cross-pod correlation
Advanced Log Aggregation (Vector)โ
For centralized logs with full-text search across all pods, deploy Vector from the Services Marketplace. Vector uses ClickHouse as its storage backend, providing high-performance log storage and analytics capabilities.
What Vector Provides:
- Centralized log collection from all pods across the cluster
- Full-text search across entire cluster using ClickHouse
- Historical log storage with configurable retention
- Multi-pod log correlation
- Export capabilities for compliance and analysis
Installing Vector:
- Navigate to Services โ Marketplace
- Search for "Vector"
- Click Deploy Vector
- Configure:
- Log retention (default: 7 days)
- Storage size (default: 10Gi)
- Storage class (OpenEBS or Longhorn)
- Click Deploy
Vector automatically deploys with ClickHouse as its database backend for storing and querying logs. After installation, a Logs menu item appears in the main navigation with full search capabilities.
For more details: See Vector Pipelines and ClickHouse in the Service Catalog.
Advanced Log Features:
Full-Text Search:
# Simple search
error
# Multiple terms (AND)
error database connection
# Phrase search
"connection refused"
# Field-specific search
namespace:mysql-prod AND level:error
# Time-based search
@timestamp:[2024-03-01 TO 2024-03-02] AND error
Filtering:
- Filter by namespace (OSID)
- Filter by pod name
- Filter by container
- Filter by log level (DEBUG, INFO, WARN, ERROR, FATAL)
- Filter by time range (last 5m, 1h, 24h, 7d, custom)
Log Correlation:
- Track requests across microservices by request ID
- View all logs in a specific time window
- Find all instances of specific error types
Export:
- Export search results to CSV, JSON, or plain text
- Bulk export for compliance or analysis
Log Best Practicesโ
Effective Logging in Applications:
- Use structured logging (JSON format preferred)
- Include request ID, user ID, and correlation IDs
- Use appropriate log levels (DEBUG, INFO, WARN, ERROR, FATAL)
- Never log passwords, secrets, or PII
Log Retention:
- Development: 1-3 days
- Staging: 7-14 days
- Production: 30-90 days (or more for compliance)
Avoid:
- Logging large payloads (truncate or summarize)
- Excessive DEBUG logging in production
- Logging in tight loops (causes storage issues)
Metrics and Monitoringโ
RunOS includes an integrated metrics suite using Prometheus for time-series data and Grafana for visualization.
What's Includedโ
The metrics suite provides comprehensive insights into:
Service Metrics:
- Resource usage (CPU, memory, disk I/O)
- Request rates and latencies
- Error rates
- Connection counts
- Query performance (for databases)
Node Metrics:
- CPU utilization per node
- Memory usage and available capacity
- Disk usage and I/O statistics
- Network traffic
- System load
Application Metrics:
- Custom application metrics
- HTTP request metrics
- Response times
- Business metrics (if exposed)
Metrics Architectureโ
Prometheus:
- Time-series database for metrics
- Automatically scrapes metrics from all pods
- Stores metrics with configurable retention
- Provides query language (PromQL)
Grafana:
- Visualization and dashboarding
- Pre-configured dashboards for all services
- Customizable dashboards
- Alerting capabilities
Accessing Metricsโ
Via Console:
- Key metrics displayed on service/application detail pages
- Pod Viewer shows CPU and memory usage
- Dashboard shows cluster-wide metrics
Via Grafana:
- Navigate to Services in main menu
- Find Grafana in your deployed services
- Click to open Grafana interface
- Browse pre-configured dashboards
Pre-configured Dashboards:
- Cluster Overview - Node health and resource usage
- Kubernetes Pods - Pod metrics across cluster
- Node Exporter - Detailed node-level metrics
- Service-Specific - Dashboards for PostgreSQL, MySQL, Redis, etc.
Understanding Metricsโ
CPU Usage:
- Measured in cores (e.g., 0.5 cores = 50% of one CPU)
- Shows current usage vs. allocated resources
- Warning if approaching limits
Memory Usage:
- Measured in bytes (MB/GB)
- Shows current usage vs. allocated limits
- Critical if pod exceeds memory limit (may be killed)
Network I/O:
- Bytes sent/received per second
- Connection counts
- Helps identify traffic patterns
Disk I/O:
- Read/write operations per second
- Latency metrics
- Important for database performance
Custom Dashboardsโ
Create custom Grafana dashboards for your specific needs:
- Open Grafana from Services
- Click + โ Dashboard
- Add panels with PromQL queries
- Save dashboard for team access
Example PromQL Queries:
# CPU usage by pod
rate(container_cpu_usage_seconds_total[5m])
# Memory usage by namespace
container_memory_usage_bytes{namespace="your-osid"}
# HTTP request rate
rate(http_requests_total[5m])
# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Metrics Retentionโ
Default retention:
- Prometheus: 15 days
- Grafana: Configurable per dashboard
Storage requirements:
- Small cluster: 10-20Gi
- Medium cluster: 50-100Gi
- Large cluster: 200-500Gi
Alertingโ
Set up alerts based on metrics thresholds (requires Grafana configuration):
Common alerts:
- High CPU usage (above 80% for 5 minutes)
- High memory usage (above 90%)
- Pod restart rate (more than 3 in 10 minutes)
- Service down (no metrics received)
- Disk space low (less than 10% free)
Resource Usage at a Glanceโ
Quickly assess resource usage across your cluster:
Node-Level:
- View all nodes and their CPU/memory utilization
- Identify overloaded nodes
- Plan capacity expansion
Service-Level:
- See resource usage per service
- Identify resource-hungry services
- Optimize resource allocations
Pod-Level:
- Individual pod metrics in Pod Viewer
- Compare pods to identify issues
- Troubleshoot performance problems
Common Observability Tasksโ
Debugging a Failed Deploymentโ
- Check Pod Viewer for pod status
- View Events tab on pod details
- Check Logs for error messages
- Review Metrics for resource limits hit
Investigating Slow Performanceโ
- Check Grafana dashboards for resource saturation
- View Logs for slow query warnings
- Check Pod Viewer for restart counts
- Review metrics for I/O bottlenecks
Monitoring Service Healthโ
- View service in Services section
- Check pod count and status
- Review Metrics for error rates
- Set up Alerts for critical thresholds
Capacity Planningโ
- Review Grafana cluster overview
- Check node resource trends over time
- Identify usage patterns
- Plan node additions before capacity is reached
Integration Between Toolsโ
All observability tools work together:
From Pod Viewer:
- See resource metrics inline for each pod
- Identify problematic pods by status
- Jump to service logs tab or Grafana for detailed analysis
From Service Logs:
- View logs from all pods for a service
- Correlate log events with pod restarts
- Track issues across replica pods
From Vector (Advanced Logs):
- Search across all services and pods
- Correlate logs with metrics and events
- Track requests across the entire cluster
From Grafana:
- View metrics alongside service logs
- Set up alerts based on log patterns and metrics
- Create comprehensive dashboards combining logs and metrics
Best Practicesโ
For Developers:
- Implement health check endpoints
- Expose custom metrics (Prometheus format)
- Use structured logging
- Include correlation IDs in logs
For Operations:
- Review metrics daily for trends
- Set up alerts for critical services
- Monitor log volume for anomalies
- Regular dashboard reviews with team
For Capacity Planning:
- Track resource usage trends
- Plan for peak loads
- Monitor growth patterns
- Add capacity proactively
Quick Referenceโ
Pod Viewer kubectl equivalents:
# List all pods
kubectl get pods -A
# View pod details
kubectl describe pod <pod-name> -n <OSID>
# Check resource usage
kubectl top pod <pod-name> -n <OSID>
# View events
kubectl get events -n <OSID> --sort-by='.lastTimestamp'
Common PromQL queries:
# CPU usage
rate(container_cpu_usage_seconds_total{namespace="your-osid"}[5m])
# Memory usage
container_memory_usage_bytes{namespace="your-osid"}
# Pod count
count(kube_pod_info{namespace="your-osid"})
# Request rate
rate(http_requests_total[5m])
Troubleshooting Observabilityโ
Metrics not showing:
- Verify Prometheus is running:
kubectl get pods -n monitoring - Check service monitors are configured
- Ensure pods expose metrics endpoints
Logs not appearing:
- Check pod is running
- Verify Vector is deployed (for advanced logs)
- Check log volume mounts have space
Grafana dashboards empty:
- Verify Prometheus data source configured
- Check time range selected in dashboard
- Ensure metrics are being scraped
With these observability tools, you have complete visibility into your cluster's health, performance, and behavior.