AIStore Observability

This document provides an overview of AIStore (AIS) observability features, tools, and practices. AIS offers comprehensive observability through logs, metrics, and a CLI interface, enabling users to monitor, debug, and optimize their deployments.

Observability Architecture

AIS provides multiple layers of observability:

┌─────────────────────────────────┐
│       Visualization Layer       │
│  ┌───────────┐    ┌───────────┐ │
│  │  Grafana  │    │   Custom  │ │
│  │ Dashboard │    │   UIs     │ │
│  └───────────┘    └───────────┘ │
├─────────────────────────────────┤
│       Collection Layer          │
│  ┌───────────┐    ┌───────────┐ │
│  │ Prometheus│    │  StatsD*  │ │
│  │           │    │           │ │
│  └───────────┘    └───────────┘ │
├─────────────────────────────────┤
│       Instrumentation Layer     │
│  ┌───────────┐    ┌───────────┐ │
│  │  Metrics  │    │   Logs    │ │
│  │ Endpoints │    │           │ │
│  └───────────┘    └───────────┘ │
├─────────────────────────────────┤
│          Access Layer           │
│  ┌───────────┐    ┌───────────┐ │
│  │    CLI    │    │   REST    │ │
│  │ Interface │    │   APIs    │ │
│  └───────────┘    └───────────┘ │
└─────────────────────────────────┘

(*) StatsD support will likely be removed in late 2025.

Transition from StatsD to Prometheus

AIS began with StatsD for metrics collection but has evolved to primarily use Prometheus. Key points about this transition:

  • Prometheus (and Grafana) is now the recommended monitoring system
  • All new metric implementations use Prometheus exclusively
  • The transition provides better scalability, more detailed metrics, variable labels for advanced filtering, and improved integration with modern observability stacks

Observability Methods

Method Description Use Cases Documentation
CLI Command-line tools for monitoring and troubleshooting Quick checks, diagnostics, interactive troubleshooting Observability: CLI
Logs Detailed event logs with configurable verbosity Debugging, audit trails, understanding system behavior Observability: Logs
Prometheus Time-series metrics exposed via HTTP endpoints Performance monitoring, alerting, trend analysis Observability: Prometheus
Metrics Reference Metric groups, names, and descriptions Quick search for specific metric Observability: Metrics Reference
Grafana Visualization dashboards for AIS metrics Visual monitoring, sharing operational status Observability: Grafana
Kubernetes Kubernetes deployments Working with Kubernetes monitoring stacks Observability: Kubernetes

Kubernetes Integration

For Kubernetes deployments, AIS provides additional observability features designed to integrate with Kubernetes monitoring stacks.

There’s a dedicated (and separate) GitHub repository that, in particular, provides Helm charts for AIS Cluster monitoring.

See the Kubernetes Observability document for details.

Key Metrics Categories

AIS exposes metrics across several categories:

  • Cluster Health: Node status, membership changes
  • Resource Usage: CPU, memory, disk utilization
  • Performance: Throughput, latency, error counts
  • Storage Operations: GET/PUT rates, object counts, error counts
  • Errors: Network errors (“broken pipe”, “connection reset”), timeouts (“deadline exceeded”), retries (“too-many-requests”), disk faults, OOM, out-of-space, and more

In addition, all supported jobs that read or write data report their respective progress in terms of objects and bytes (counts).

Briefly, two CLI examples:

Cluster performance: operation counts and latency

$ ais performance latency --refresh 10 --regex get

| TARGET | AWS-GET(n) | AWS-GET(t) | GET(n) | GET(t) | GET(total/avg size) | RATELIM-RETRY-GET(n) | RATELIM-RETRY-GET(t) |
|:------:|:----------:|:----------:|:------:|:------:|:--------------------:|:---------------------:|:---------------------:|
| T1     | 800        | 180ms      | 3200   | 25ms   | 12GB / 3.75MB       | 50                    | 240ms                |
| T2     | 1000       | 150ms      | 4000   | 28ms   | 15GB / 3.75MB       | 70                    | 230ms                |
| T3     | 700        | 200ms      | 2800   | 32ms   | 10GB / 3.57MB       | 40                    | 215ms                |

- **AWS-GET(n)** / **AWS-GET(t)**: Number and average latency of GET requests that actually hit the AWS backend.
- **GET(n)** / **GET(t)**: Number and average latency of *all* GET requests (including those served from local cache or in-cluster data).
- **GET(total/avg size)**: Approximate total data read and corresponding average object size.
- **RATELIM-RETRY-GET(n)** / **RATELIM-RETRY-GET(t)**: Number and average latency of GET requests retried due to hitting the rate limit.

Batch job: Prefetch

$ ais show job prefetch --refresh 10

prefetch-objects[MV4ex8u6h] (run options: prefix:10, workers: 16, parallelism: w[16] chan-full[8,32])
NODE             ID              KIND                    BUCKET          OBJECTS         BYTES           START           END     STATE
KactABCD         MV4ex8u6h       prefetch-listrange      s3://cloud-bucket 27              27.00MiB        18:28:55        -       Running
XXytEFGH         MV4ex8u6h       prefetch-listrange      s3://cloud-bucket 23              23.00MiB        18:28:55        -       Running
YMjtIJKL         MV4ex8u6h       prefetch-listrange      s3://cloud-bucket 41              41.00MiB        18:28:55        -       Running
oJXtMNOP         MV4ex8u6h       prefetch-listrange      s3://cloud-bucket 34              34.00MiB        18:28:55        -       Running
vWrtQRST         MV4ex8u6h       prefetch-listrange      s3://cloud-bucket 23              23.00MiB        18:28:55        -       Running
ybTtUVWX         MV4ex8u6h       prefetch-listrange      s3://cloud-bucket 31              31.00MiB        18:28:55        -       Running
                                Total:                                  179             179.00MiB ✓

Best Practices

  • Configure appropriate log levels based on your deployment stage (development or production).
  • Set up alerting for critical metrics using Prometheus AlertManager to proactively monitor system health.
  • Implement regular dashboard reviews to analyze short- and long-term statistics and identify performance trends.
  • View or download logs via Loki. You can also use the CLI commands ais log or ais cluster download-logs (use --help for details) to access logs for troubleshooting and analysis.

Further Reading