AIStore natively supports four archive/serialization formats across all APIs, batch jobs, and functional extensions: TAR, TGZ (TAR.GZ), TAR.LZ4, and ZIP.

Motivation

Archives address the small-file problem - performance degradation from random access to very large datasets containing many small files.

To qualify “very large” and “small-file” - the range of the numbers we usually see in the field include datasets containing 10+ million files with sizes ranging from 1K to 100K.

AIStore’s implementation allows unmodified clients and applications to work efficiently with archived datasets.

Key benefits:

  • Improved I/O performance via reduced metadata lookups and network roundtrips
  • Seamless integration with existing, unmodified workflows
  • Implicit dataset backup: each archive acts as a self-contained, immutable copy of the original files

In addition to performance, sharded datasets provide a natural form of dataset backup: each shard is a self-contained, immutable representation of its original files, making it easy to replicate, snapshot, or version datasets without additional tooling.

Supported Formats

  • TAR (.tar) - Unix archive format (since 1979) supporting USTAR, PAX, and GNU TAR variants
  • TGZ (.tgz, .tar.gz) - TAR with gzip compression
  • TAR.LZ4 (.tar.lz4) - TAR with lz4 compression
  • ZIP (.zip) - PKWARE ZIP format (since 1989)

Operations

AIStore can natively read, write, append¹, and list archives. Operations include:

  • Regular GET and PUT requests:
  • get-batch - efficient multi-object/multi-file retrieval
  • list-objects - “opens” archives and includes contained pathnames in results
  • dsort - distributed archive creation and transformation
  • aisloader - benchmarking with archive workloads
  • Concurrent multi-object transactions for bulk archive generation from selected objects

Default format: TAR is the system default when serialization format is unspecified.


¹ APPEND is supported for TAR format only. Other formats (ZIP, TGZ, TAR.LZ4) were not designed for true append operations - only extract-all-recreate emulation, which significantly impacts performance.

See also