AIStore Observability: Prometheus

AIStore (AIS) exposes metrics in Prometheus format via HTTP endpoints. This integration enables comprehensive monitoring of AIS clusters, performance tracking, and trend analysis.

Overview
Monitoring Architecture
Prometheus Integration
Node Alerts
StatsD Alternative
- StatsD Exporter for Prometheus
- Setup and Configuration
Best Practices
References
Related Documentation

Overview

AIS tracks a comprehensive set of performance metrics including:

Performance counters
Resource utilization percentages
Latency and throughput metrics
Data transfer statistics (total bytes and object counts)
Error counters and operational status

Full observability is supported using multiple complementary tools:

AIS node logs for detailed diagnostics
CLI for interactive monitoring, specifically the ais show cluster stats command
Monitoring backends:
- Prometheus (recommended)
- StatsD with any compliant backend (e.g., Graphite/Grafana)

For information on load testing metrics, please refer to AIS Load Generator and How To Benchmark AIStore.

Monitoring Architecture

The typical monitoring setup with Prometheus looks as follows:

┌────────────────┐       ┌────────────────┐
│                │ scrape│                │
│   Prometheus   │◄──────┤  AIStore Node  │
│                │       │   /metrics     │
└────────────────┘       └────────────────┘
        │
        │ query
        ▼
┌────────────────┐
│                │
│     Grafana    │
│                │
└────────────────┘

This layout provides:

Direct metric collection from AIS nodes
Centralized metric storage in Prometheus
Powerful visualization through Grafana dashboards
Historical trend analysis and alerting capabilities

Prometheus Integration

Native Exporter

AIS is a fully compliant Prometheus exporter that natively supports metric collection without additional components. Key integration points:

Configuration: No special configuration is required - simply build AIS without the statsd build tag to enable Prometheus support
Metric Registration: When starting, each AIS node (gateway or storage target) automatically:
- Registers all metric descriptions (names, labels, and help text) with Prometheus
- Exposes the HTTP endpoint /metrics for Prometheus scraping
Build Selection: The choice between StatsD and Prometheus is a build-time decision controlled by the statsd build tag

For the complete list of supported build tags, please see conditional linkage.

Viewing Raw Metrics

You can directly view the exposed metrics using curl:

$ curl http://<aistore-node-ip-or-hostname>:<port>/metrics

# For HTTPS deployments:
$ curl https://<aistore-node-ip-or-hostname>:<port>/metrics

Sample output:

# HELP ais_target_disk_avg_rsize average read size (bytes)
# TYPE ais_target_disk_avg_rsize gauge
ais_target_disk_avg_rsize{disk="nvme0n1",node_id="ClCt8081"} 4096
# HELP ais_target_disk_avg_wsize average write size (bytes)
# TYPE ais_target_disk_avg_wsize gauge
ais_target_disk_avg_wsize{disk="nvme0n1",node_id="ClCt8081"} 260130
# HELP ais_target_disk_read_mbps read bandwidth (MB/s)
# TYPE ais_target_put_bytes counter
...
ais_target_put_bytes{node_id="ClCt8081"} 1.721761792e+10
# HELP ais_target_put_count total number of executed PUT(object) requests
# TYPE ais_target_put_count counter
ais_target_put_count{node_id="ClCt8081"} 1642
# HELP ais_target_put_ns_total PUT: total cumulative time (nanoseconds)
# TYPE ais_target_put_ns_total counter
ais_target_put_ns_total{node_id="ClCt8081"} 9.44367232e+09
# TYPE ais_target_state_flags gauge
ais_target_state_flags{node_id="ClCt8081"} 6
# HELP ais_target_uptime this node's uptime since its startup (seconds)
# TYPE ais_target_uptime gauge
ais_target_uptime{node_id="ClCt8081"} 210
...

For continuous monitoring of specific metrics without a full Prometheus deployment:

for i in {1..99999}; do
  curl http://hostname:8081/metrics --silent | grep "ais_target_get_n.*node"
  sleep 1
done

Key Metrics Groups

Category	Metrics Prefix	Examples	Usage
Operations	`ais_target_get_`, `ais_target_put_`	`ais_target_get_count`, `ais_target_put_bytes`	Track throughput, operation counts
Resources	`ais_target_disk_`, `ais_target_mem_`	`ais_target_disk_util`, `ais_target_mem_used`	Monitor resource consumption
Errors	`ais_target_err_*`	`ais_target_err_get_count`	Track operation failures
Cloud operations	`ais_target_cloud_*`	`ais_target_cloud_get_count`	Monitor cloud backend activity
System	`ais_target_uptime`, `ais_target_rebalance_*`	`ais_target_uptime`, `ais_target_rebalance_objects`	System status, rebalancing

Metric Labels

AIS exposes labels for detailed filtering and aggregation:

Label	Description	Type
`node_id`	Unique node identifier	Static
`disk`	Disk identifier for storage metrics	Variable
`bucket`	Bucket name	Variable
`xaction`	Extended action (batch job) identifier	Variable

Variable labels provide powerful filtering capabilities only available in Prometheus mode.

Essential Prometheus Queries

Here are key PromQL queries for operational monitoring:

# Cluster-wide GET operations per second (rate over 5m)
sum(rate(ais_target_get_count[5m]))

# Average GET latency in milliseconds
sum(rate(ais_target_get_ns_total[5m])) / sum(rate(ais_target_get_count[5m])) / 1000000

# Disk utilization per target
ais_target_disk_util{disk="nvme0n1"}

# Error rate as percentage of operations
sum(rate(ais_target_err_get_count[5m])) / sum(rate(ais_target_get_count[5m])) * 100

# Cluster storage capacity utilization
sum(ais_target_capacity_used) / sum(ais_target_capacity_total) * 100

# Node health status (state flags)
ais_target_state_flags

Node Alerts

AIStore node states are categorized into three severity levels:

Red Alerts - Critical issues requiring immediate attention:
- OOS - Out of space condition
- OOM - Out of memory condition
- OOCPU - Out of CPU resources
- DiskFault - Disk failures detected
- NoMountpaths - No available mountpaths
- NumGoroutines - Excessive number of goroutines
- CertificateExpired - TLS certificate has expired
- CertificateInvalid - TLS certificate is invalid
Warning Alerts - Potential issues that may require attention:
- Rebalancing - Rebalance operation in progress
- RebalanceInterrupted - Rebalance was interrupted
- Resilvering - Resilvering operation in progress
- ResilverInterrupted - Resilver was interrupted
- NodeRestarted - Node was restarted (powercycle, crash)
- MaintenanceMode - Node is in maintenance mode
- LowCapacity - Low storage capacity (OOS possible soon)
- LowMemory - Low memory condition (OOM possible soon)
- LowCPU - Low CPU availability
- CertWillSoonExpire - TLS certificate will expire soon
- KeepAliveErrors - Recent keep-alive errors detected
Information States - Normal operational states:
- ClusterStarted - Cluster has started (primary) or node has joined cluster
- NodeStarted - Node has started (may not have joined cluster yet)
- VoteInProgress - Voting process is in progress

Node state flags are exposed via the Prometheus metric ais_target_state_flags and can be monitored using the following methods:

CLI Monitoring

The node state can be viewed directly using the CLI:

$ ais show cluster

This command displays the state for all nodes in the cluster, including any active alerts.

Prometheus Queries

To monitor node states with Prometheus:

# Detect nodes with any red alert condition
ais_target_state_flags > 0 and on (node_id) (
  ais_target_state_flags & 8192 > 0 or  # OOS
  ais_target_state_flags & 16384 > 0 or # OOM
  ais_target_state_flags & 262144 > 0 or # OOCPU
  ais_target_state_flags & 65536 > 0 or # DiskFault
  ais_target_state_flags & 131072 > 0 or # NoMountpaths
  ais_target_state_flags & 262144 > 0 or # NumGoroutines
  ais_target_state_flags & 1048576 > 0 # CertificateExpired
)

# Find nodes with warning conditions
ais_target_state_flags > 0 and on (node_id) (
  ais_target_state_flags & 8 > 0 or # Rebalancing
  ais_target_state_flags & 16 > 0 or # RebalanceInterrupted
  ais_target_state_flags & 32 > 0 or # Resilvering
  ais_target_state_flags & 64 > 0 or # ResilverInterrupted
  ais_target_state_flags & 128 > 0 or # NodeRestarted
  ais_target_state_flags & 32768 > 0 or # MaintenanceMode
  ais_target_state_flags & 4096 > 0 or # LowCapacity
  ais_target_state_flags & 8192 > 0 # LowMemory
)

Grafana Alerting

In Grafana, you can set up alerts based on these node state flags:

Create a Grafana alert rule using the PromQL queries above
Set appropriate thresholds and notification channels
Configure different severity levels for red vs. warning conditions

Example Grafana alert rule for red alerts:

# Alert on critical node conditions
ais_target_state_flags{node_id=~"$node"} > 0 and (
  ais_target_state_flags{node_id=~"$node"} & 8192 > 0 or
  ais_target_state_flags{node_id=~"$node"} & 16384 > 0 or
  ais_target_state_flags{node_id=~"$node"} & 262144 > 0
)

This alerting system provides comprehensive visibility into the operational state of your AIStore cluster and helps detect issues before they impact performance or availability.

StatsD Alternative

Important: StatsD support is deprecated and will likely be removed by the end of 2025. New deployments should use the native Prometheus integration described above.

StatsD Exporter for Prometheus

If specific requirements necessitate using StatsD, you can still integrate with Prometheus using its statsd_exporter component that translates StatsD metrics to Prometheus format on-the-fly.

Note: Native Prometheus integration is the preferred option. StatsD exporter should only be considered for deployments with special requirements.

Architecture with StatsD exporter:

AIStore monitoring with Prometheus

In this configuration:

AIS nodes send StatsD metrics to a UDP endpoint
The statsd_exporter receives these metrics and converts them to Prometheus format
Prometheus scrapes the exporter’s HTTP endpoint
Grafana queries Prometheus for visualization

Setup and Configuration

To deploy the StatsD exporter:

Use the prebuilt container image, or

Install from source:

$ go install github.com/prometheus/statsd_exporter@latest

For testing without Prometheus, run with debug logging:

$ statsd_exporter --statsd.listen-udp localhost:8125 --log.level debug

Example debug output:

level=info ts=2021-05-13T15:30:22.251Z caller=main.go:321 msg="Starting StatsD -> Prometheus Exporter" version="(version=, branch=, revision=)"
level=info ts=2021-05-13T15:30:22.251Z caller=main.go:322 msg="Build context" context="(go=go1.16.3, user=, date=)"
level=info ts=2021-05-13T15:30:22.251Z caller=main.go:361 msg="Accepting StatsD Traffic" udp=localhost:8125 tcp=:9125 unixgram=
level=info ts=2021-05-13T15:30:22.251Z caller=main.go:362 msg="Accepting Prometheus Requests" addr=:9102
level=debug ts=2021-05-13T15:30:27.811Z caller=listener.go:73 msg="Incoming line" proto=udp line=aistarget.pakftUgh.kalive.latency:1|ms
level=debug ts=2021-05-13T15:30:29.891Z caller=listener.go:73 msg="Incoming line" proto=udp line=aisproxy.qYyhpllR.pst.count:77|c

Finally, configure Prometheus to scrape the exporter’s metrics endpoint (default port 9102).

Default port configuration:

StatsD UDP input: 8125
Prometheus HTTP endpoint: 9102

To see all configuration options:

$ statsd_exporter --help

Best Practices

To maximize the value of AIStore’s Prometheus integration:

Retention Planning: Configure appropriate retention periods in Prometheus based on your monitoring needs
Dashboard Organization: Create dedicated Grafana dashboards for:
- Cluster overview (high-level health)
- Per-node performance
- Resource utilization
- Operation latencies
- Error analysis
Alerting: Configure alerts for critical conditions:
- Node state red alerts (OOS, OOM, DiskFault, etc.)
- High error rates
- Disk utilization thresholds
- Performance degradation
Metric Selection: Focus on key operational metrics for routine monitoring
Collection Frequency: Balance scrape intervals for accuracy versus storage requirements

References

Document	Description
Overview	Introduction to AIS observability
CLI	Command-line monitoring tools
Logs	Log-based observability
Metrics Reference	Complete metrics catalog
Grafana	Visualizing AIS metrics with Grafana
Kubernetes	Working with Kubernetes monitoring stacks

MONITORING-PROMETHEUS

AIStore Observability: Prometheus

Table of Contents

Overview

Monitoring Architecture

Prometheus Integration

Native Exporter

Viewing Raw Metrics

Key Metrics Groups

Metric Labels

Essential Prometheus Queries

Node Alerts

CLI Monitoring

Prometheus Queries

Grafana Alerting

StatsD Alternative

StatsD Exporter for Prometheus

Setup and Configuration

Best Practices

References

AIStore Observability: Prometheus

Table of Contents

Overview

Monitoring Architecture

Prometheus Integration

Native Exporter

Viewing Raw Metrics

Key Metrics Groups

Metric Labels

Essential Prometheus Queries

Node Alerts

CLI Monitoring

Prometheus Queries

Grafana Alerting

StatsD Alternative

StatsD Exporter for Prometheus

Setup and Configuration

Best Practices

References

Related Documentation