MONITORING-GRAFANA
AIStore Grafana Integration
This document explains how to integrate AIStore with Grafana for visualization and monitoring of AIStore metrics. Grafana provides powerful visualization capabilities for Prometheus metrics collected from AIStore nodes.
Table of Contents
- Architecture
- Prerequisites
- Deployment Options
- Grafana Setup
- Available Dashboards
- Creating Custom Dashboards
- Alert Configuration
- Production Best Practices
- Troubleshooting
- Further Resources
Architecture
The integration architecture follows a standard monitoring pattern:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ │ │ │ │ │
│ AIStore │scrape│ Prometheus │query │ Grafana │
│ Nodes │─────▶│ Server │─────▶│ Dashboards │
│ │ │ │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
- AIStore nodes expose metrics endpoints (via the
/metrics
HTTP endpoint) - Prometheus scrapes metrics at regular intervals
- Grafana queries Prometheus and visualizes the data through customizable dashboards
Prerequisites
- AIStore cluster with Prometheus metrics enabled (default in current versions)
- Prometheus server configured to scrape AIStore metrics
- Grafana server (v9.0+ recommended)
Deployment Options
Kubernetes Deployment
For production deployments, AIStore is typically deployed on Kubernetes using the ais-k8s operator.
The github.com/NVIDIA/ais-k8s repository includes monitoring components that set up Prometheus and Grafana with preconfigured dashboards in the monitoring directory.
To deploy AIStore with monitoring on Kubernetes:
- Clone the ais-k8s repository:
git clone https://github.com/NVIDIA/ais-k8s.git cd ais-k8s
- Deploy AIStore with the operator, which includes monitoring stack:
# Follow the operator deployment instructions in ais-k8s documentation
- The monitoring stack includes:
- Prometheus for metrics collection
- Grafana for visualization
- AlertManager for alert routing
- Preconfigured dashboards for AIStore
Standalone Deployment
For non-Kubernetes deployments or development environments, you can set up Grafana separately.
Grafana Installation
Docker Installation
docker run -d -p 3000:3000 --name grafana grafana/grafana-oss
Standard Installation
# Debian/Ubuntu
sudo apt-get install -y apt-transport-https software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
sudo apt-get update
sudo apt-get install grafana
# Start Grafana
sudo systemctl start grafana-server
For other platforms, see the Grafana installation documentation.
Grafana Setup
Adding Prometheus Data Source
- Log in to Grafana (default: http://localhost:3000, admin/admin)
- Go to Configuration > Data Sources > Add data source
- Select Prometheus
- Set the URL to your Prometheus server (e.g., http://prometheus-server:9090)
- Click “Save & Test” to verify the connection
Importing AIStore Dashboards
AIStore provides pre-built Grafana dashboards for monitoring:
Method 1: Import from JSON
- Download the AIStore dashboard JSON file:
- In Grafana, go to Dashboards > Import
- Upload the JSON file or paste its contents
- Select your Prometheus data source
- Click Import
Method 2: Manual Dashboard Creation
If you prefer to build dashboards from scratch:
- In Grafana, create a new dashboard (+ > Create > Dashboard)
- Add panels with AIStore-specific metrics using the PromQL queries provided in the Example PromQL Queries section
- Organize panels into logical sections (cluster overview, node details, operations, etc.)
- Save the dashboard
Available Dashboards
AIStore Cluster Dashboard
The main AIStore dashboard provides:
- Cluster health overview
- Storage capacity and usage
- Throughput and latency metrics
- Operation rates by type
- Error rates
- Resource usage (CPU, memory)
Key panels include:
- Cluster Health: Node status, rebalance status
- Operations: GET/PUT/DELETE throughput, latency statistics
- Storage: Capacity, utilization, growth trends
- Resources: CPU, memory, network usage by node
- Errors: Error rates, error distribution by type
AIStore Kubernetes Dashboard
For Kubernetes deployments, the specialized dashboard includes:
- Pod status and health
- Node resource utilization by pod
- Storage performance by pod
- Network metrics
- Rebalance monitoring
- Kubernetes-specific resource allocation
Creating Custom Dashboards
Key Metrics to Visualize
When creating custom dashboards, consider including:
- Cluster Health
- Number of online nodes
- Storage capacity and usage
- Error rates
- Node states and alerts
- Performance
- Throughput (GET/PUT/DELETE)
- Operation latency
- Request rates
- Cache hit ratios
- Resources
- CPU utilization
- Memory usage
- Disk I/O
- Network traffic
Example PromQL Queries
Throughput Panels
# GET throughput (bytes/sec)
sum(rate(ais_target_get_bytes[5m]))
# PUT throughput (bytes/sec)
sum(rate(ais_target_put_bytes[5m]))
# DELETE operations/sec
sum(rate(ais_target_delete_count[5m]))
Latency Panels
# GET latency (milliseconds) using PromQL
sum(rate(ais_target_get_ns_total[5m])) / sum(rate(ais_target_get_count[5m])) / 1000000
# PUT latency (milliseconds) using PromQL
sum(rate(ais_target_put_ns_total[5m])) / sum(rate(ais_target_put_count[5m])) / 1000000
Storage Usage Panel
# Cluster storage utilization percentage
100 * sum(ais_target_capacity_used) / sum(ais_target_capacity_total)
# Storage used by node
ais_target_capacity_used{node_id="$node"}
Node State Panels
# Node state flags (alert conditions)
ais_target_state_flags{node_id="$node"}
# Nodes with red alerts (OOS - out of space)
ais_target_state_flags > 0 and ais_target_state_flags & 8192 > 0
Variable Templates
Create dashboard variables to make your dashboard more interactive:
- Dashboard Settings > Variables > New
- Create a variable for node selection:
- Name:
node
- Type: Query
- Data source: Prometheus
- Query:
label_values(ais_target_uptime, node_id)
- Name:
Use the variable in your queries: {node_id="$node"}
Alert Configuration
Grafana can be configured to send alerts based on metric thresholds:
Node State Alerts
Set up alerts based on the AIStore node state flags (refer to Node Alerts in AIStore Prometheus docs for all available states):
- Create an alert for red alert conditions:
- Condition:
ais_target_state_flags{node_id=~"$node"} > 0 and (ais_target_state_flags{node_id=~"$node"} & 8192 > 0 or ais_target_state_flags{node_id=~"$node"} & 16384 > 0)
- Description: “Critical node state alert detected”
- Severity: Critical
- Condition:
- Create an alert for warning conditions:
- Condition:
ais_target_state_flags{node_id=~"$node"} > 0 and (ais_target_state_flags{node_id=~"$node"} & 128 > 0 or ais_target_state_flags{node_id=~"$node"} & 4096 > 0)
- Description: “Warning node state alert detected”
- Severity: Warning
- Condition:
Performance Alerts
- High latency alert:
- Condition:
sum(rate(ais_target_get_ns_total[5m])) / sum(rate(ais_target_get_count[5m])) / 1000000 > 200
- Description: “GET operation latency exceeds 200ms”
- Condition:
- Error rate alert:
- Condition:
sum(rate(ais_target_err_get_count[5m])) / sum(rate(ais_target_get_count[5m])) * 100 > 1
- Description: “GET error rate exceeds 1%”
- Condition:
Resource Alerts
Common alert thresholds:
Metric | Warning | Critical | Description |
---|---|---|---|
Disk Usage | 85% | 95% | Storage capacity utilization |
Error Rate | 1% | 5% | Operation error percentage |
Node Count | n-1 | n-2 | Number of online nodes vs. expected |
CPU Usage | 70% | 90% | CPU utilization |
Memory Usage | 80% | 95% | Memory utilization |
Latency | 2x baseline | 5x baseline | Operation latency increase |
Production Best Practices
- Retention and Sampling
- Configure appropriate retention periods in Prometheus
- Use recording rules for complex queries
- Consider downsampling for long-term storage
- Dashboard Organization
- Group related metrics on the same dashboard
- Use row dividers to organize panels
- Add documentation links and descriptions
- Performance Considerations
- Limit the number of panels per dashboard
- Use appropriate time ranges
- Avoid overly complex queries that could impact Prometheus
- High Availability
- Deploy Prometheus with high availability for production
- Consider using Grafana Enterprise for critical environments
- Set up redundant alerting paths
- Security
- Configure proper authentication
- Set up appropriate user roles
- Share dashboards with read-only permissions
- Use TLS for all communications
Troubleshooting
No Data in Grafana
- Verify Prometheus data source connection:
- Check Prometheus server health
- Ensure connectivity between Grafana and Prometheus
- Test connection in Data Sources section
- Check metrics collection:
- Verify AIStore nodes are exposing metrics
- Test direct access to metrics endpoint:
curl http://<aistore-node>:<port>/metrics
- Check Prometheus targets status
- Validate PromQL queries:
- Test queries directly in Prometheus UI
- Check for typos in metric names
- Verify label selectors match your deployment
Incomplete Metrics
- Ensure all AIStore nodes are being scraped:
- Check Prometheus targets
- Verify scrape configurations
- Check for connectivity issues
- Check for missing or incomplete metrics:
- Review AIStore logs for metrics-related messages
- Verify AIStore version compatibility with dashboards
- Check for misconfiguration in Prometheus scrape settings
Dashboard Performance Issues
- Optimize queries:
- Simplify complex queries
- Add appropriate time ranges
- Use recording rules for frequently used queries
- Reduce load:
- Decrease dashboard refresh rate
- Limit the number of panels per dashboard
- Consider splitting complex dashboards