Machine‑Learning Operations (ais ml)

Introduced in v3.30, the ml namespace is intended for commands that target ML‑centric data workflows — bulk extraction of training samples, cross‑bucket collation of model artifacts, and manifest‑driven slicing of large corpora.

In this document:

  • ais ml get-batch – one‑shot consolidation of objects (and archived sub‑objects) into a single TAR/TGZ/ZIP/TAR.LZ4.

  • ais ml lhotse-get-batch – higher‑level driver that reads Lhotse cut manifests and spawns one or many get-batch jobs on your behalf.

Jump straight to the section you need:

Table of Contents


Subcommands

$ ais ml <TAB><TAB>
get-batch         # Consolidate objects / archived files into one archive
lhotse-get-batch  # Drive multiple get‑batch runs from Lhotse manifests

--help is available at every level:

$ ais ml get-batch --help
$ ais ml lhotse-get-batch --help

ais ml get-batch

Fetch one set of inputs—objects, archived files, or byte ranges—possibly spanning multiple buckets and providers, and package them into a single output archive. Default format is TAR; pass the destination name with .tgz, .zip, .tar.lz4, etc. to switch.

Typical uses:

  • Build a reproducible training snapshot from disparate buckets.
  • Extract a handful of files from a huge shard without rewriting the shard.
  • Stream an archive directly into a training job (--streaming).

Usage

ais ml get-batch [SRC ...] DST_ARCHIVE [flags]

# SRC can be:
#   - bucket or virtual directory      (ais://speech/)
#   - single object                    (s3://models/chkpt‑001.pth)
#   - archived path inside an object   (ais://data/shard.tar/file.wav)

Examples

Goal One‑liner
Package two objects as training.tar ais ml get-batch ais://dataset --list "audio1.wav,audio2.wav" training.tar
Build compressed range archive ais ml get-batch ais://models --template "checkpoint-{001..100}.pth" model-v2.tgz
Extract a single file from a shard ais ml get-batch ais://data/shard.tar/file.wav dataset.zip
Drive via JSON spec on disk ais ml get-batch training.tar --spec batch.json
Inline YAML spec, stream to disk ais ml get-batch /tmp/out.tar --spec '{ in: [...], streaming_get: true }' --streaming

Key Options

Flag Purpose
--spec, -f PATH JSON or YAML request; overrides CLI flags.
--list OBJ1,OBJ2,... Explicit comma‑separated object list.
--template STR Brace‑expansion template (supports ranges, steps).
--prefix DIR/ Only objects whose names start with the prefix.
--omit-src-bck Strip bucket name inside the resulting archive.
--streaming Stream archive as it is built (lower latency, constant memory).
--cont-on-err Keep going on per‑object errors (best‑effort).
--yes, -y Assume “yes” to all prompts (non‑interactive).
--non-verbose, -nv Minimal progress output.

Full proto definition: api/apc/ml.go.


ais ml lhotse-get-batch

Consumes a Lhotse cuts.jsonl[.gz | .lz4] manifest and spawns one or many get-batch transactions. Ideal for speech/ASR pipelines where a manifest describes thousands of time‑offsets across many recordings.

Key distinction:

Command Output archives Driving data
get-batch exactly one CLI flags or JSON/YAML spec
lhotse-get-batch one or many Lhotse cut manifest

Lhotse Manifest

In a Lhotse manifest, each cut JSON line lists one or more recording sources (URI + byte/time offsets).

Manifests may be:

  • Plain text – cuts.jsonl
  • Gzip‑compressed – cuts.jsonl.gz or cuts.jsonl.gzip
  • LZ4‑compressed – cuts.jsonl.lz4

Each line is an independent cut. See the Lhotse docs for the full schema.

Usage

ais ml lhotse-get-batch --cuts manifest.jsonl[.gz | .lz4] [DST] [flags]

# With --output-template you may omit DST; the template expands per batch.
$ ais ml lhotse-get-batch --help
NAME:
   ais ml lhotse-get-batch - Get multiple objects from Lhotse manifests and package into consolidated archive(s).
     Returns TAR by default; supported formats include: .tar, .tgz or .tar.gz, .zip, .tar.lz4.
     Supports chunking, filtering, and multi-output generation from Lhotse cut manifests.
     Lhotse manifest format: each line contains a single cut JSON object with recording sources;
     Lhotse manifest may be plain (`.jsonl`), gzip‑compressed (`.jsonl.gz` / `.gzip`), or LZ4‑compressed (`.jsonl.lz4`).
     Examples:
     - 'ais ml lhotse-get-batch --cuts manifest.jsonl.gz output.tar'                                        - entire manifest as single TAR;
     - 'ais ml lhotse-get-batch --cuts cuts.jsonl --sample-rate 16000 output.tar'                           - with sample rate conversion;
     - 'ais ml lhotse-get-batch --cuts m.jsonl.lz4 --batch-size 1000 --output-template "a-{001..999}.tar"'  - generate 999 'a-*.tar' batches (1000 cuts each)
     See [Lhotse docs](https://lhotse.readthedocs.io/) for manifest format details.

USAGE:
   ais ml lhotse-get-batch [BUCKET[/NAME_or_TEMPLATE] ...] [DST_ARCHIVE] --spec [JSON_SPECIFICATION|YAML_SPECIFICATION] [command options]

OPTIONS:
   batch-size       number of cuts per output file
   cont-on-err      Keep running archiving xaction (job) in presence of errors in any given multi-object transaction
   cuts             path to Lhotse cuts.jsonl or cuts.jsonl.gz or cuts.jsonl.lz4
   list             Comma-separated list of object or file names, e.g.:
                    --list 'o1,o2,o3'
                    --list "abc/1.tar, abc/1.cls, abc/1.jpeg"
                    or, when listing files and/or directories:
                    --list "/home/docs, /home/abc/1.tar, /home/abc/1.jpeg"
   non-verbose,nv   Non-verbose (quiet) output, minimized reporting, fewer warnings
   omit-src-bck     When set, strip source bucket names from paths inside the archive (ie., use object names only)
   output-template  template for multiple output files (e.g. 'batch-{001..999}.tar')
   prefix           Select virtual directories or objects with names starting with the specified prefix, e.g.:
                    '--prefix a/b/c'   - matches names 'a/b/c/d', 'a/b/cdef', and similar;
                    '--prefix a/b/c/'  - only matches objects from the virtual directory a/b/c/
   sample-rate      audio sample-rate (Hz); used to convert sample offsets (in seconds) to byte offsets
   spec,f           Path to JSON or YAML request specification
   streaming        stream the resulting archive prior to finalizing it in memory
   template         Template to match object or file names; may contain prefix (that could be empty) with zero or more ranges
                    (with optional steps and gaps), e.g.:
                    --template "" # (an empty or '*' template matches everything)
                    --template 'dir/subdir/'
                    --template 'shard-{1000..9999}.tar'
                    --template "prefix-{0010..0013..2}-gap-{1..2}-suffix"
                    and similarly, when specifying files and directories:
                    --template '/home/dir/subdir/'
                    --template "/abc/prefix-{0010..9999..2}-suffix"
   yes,y            Assume 'yes' to all questions
   help, h          Show help

Examples

Goal Command
Single TAR from entire manifest ais ml lhotse-get-batch --cuts manifest.jsonl.gz output.tar
Resample offsets for 16 kHz ais ml lhotse-get-batch --cuts cuts.jsonl --sample-rate 16000 output.tar
Chunk into 1,000‑cut archives ais ml lhotse-get-batch --cuts manifest.jsonl.lz4 --batch-size 1000 --output-template "batch-{001..999}.tar"

Key Options

Flag Purpose
--cuts PATH Required. Path to cuts.jsonl or .jsonl.gz or .jsonl.lz4.
--batch-size N Number of cuts per output archive (default: all).
--output-template STR Brace‑expansion template for multi‑file output.
--sample-rate HZ Convert time offsets (sec) → byte ranges for this rate.
--spec, -f PATH Optional JSON/YAML spec (applies to every batch).

Remaining flags (--list, --template, --prefix, --omit-src-bck, --streaming, --cont-on-err, --yes, --nv) behave exactly like in get-batch.


Tips & Best Practices

  • Streaming (--streaming) vs multipart (buffered) The flag only toggles how AIStore emits the response — nothing changes in the CLI itself.

    Mode AIS behavior on the wire When it helps
    --streaming AIStore delivers the next set of requested samples as soon as possible; the resulting (received) archive grows incrementally; there is no leading apc.MossResp JSON message. • You require early‑data/low‑latency playback
    • AIStore shows low-memory alert (via ais show cluster or Grafana)
    Multipart (flag omitted) AIS first assembles the entire resulting archive in memory. The HTTP reply then has two parts:
    1. apc.MossResp JSON (ordered one‑to‑one with the request)
    2. TAR/TGZ/.tar.lz4/ZIP payload.
    • Plenty of memory on the server side
    • You want the header (sizes, order) upfront
    • Maximum bulk throughput on a fast LAN
  • Leaving bucketprovider blank in a spec entry
    • Pass a default bucket when you call the API — e.g. ais ml get-batch ais://my‑bck --spec …. Any entry with an empty bucket field inherits my‑bck.
    • If provider is omitted the cluster assumes ais://. For S3, GCS, Azure, OCI, etc., always spell out "provider":"s3", "gcp", … per entry.
    • Mixed‑provider requests are fine; just be explicit for every non‑AIS object.
  • Manifests at scale. lhotse-get-batch --batch-size divides the manifest on the client side—no need to pre‑split the file.
  • Check specs into Git. Treat JSON/YAML batch specs as immutable artifacts for reproducibility.

See Also