ARCHIVE
AIStore natively supports four archive/serialization formats across all APIs, batch jobs, and functional extensions: TAR, TGZ (TAR.GZ), TAR.LZ4, and ZIP.
Motivation
Archives address the small-file problem - performance degradation from random access to very large datasets containing many small files.
To qualify “very large” and “small-file” - the range of the numbers we usually see in the field include datasets containing 10+ million files with sizes ranging from 1K to 100K.
AIStore’s implementation allows unmodified clients and applications to work efficiently with archived datasets.
Key benefits:
- Improved I/O performance via reduced metadata lookups and network roundtrips
- Seamless integration with existing, unmodified workflows
- Implicit dataset backup: each archive acts as a self-contained, immutable copy of the original files
In addition to performance, sharded datasets provide a natural form of dataset backup: each shard is a self-contained, immutable representation of its original files, making it easy to replicate, snapshot, or version datasets without additional tooling.
Supported Formats
- TAR (
.tar) - Unix archive format (since 1979) supporting USTAR, PAX, and GNU TAR variants - TGZ (
.tgz,.tar.gz) - TAR with gzip compression - TAR.LZ4 (
.tar.lz4) - TAR with lz4 compression - ZIP (
.zip) - PKWARE ZIP format (since 1989)
Operations
AIStore can natively read, write, append¹, and list archives. Operations include:
- Regular GET and PUT requests:
- Go API - see “ArchPath” parameter
- Python SDK - ditto
- Python SDK/Archive - see archive-related config
- get-batch - efficient multi-object/multi-file retrieval
- list-objects - “opens” archives and includes contained pathnames in results
- dsort - distributed archive creation and transformation
- aisloader - benchmarking with archive workloads
- Concurrent multi-object transactions for bulk archive generation from selected objects
Default format: TAR is the system default when serialization format is unspecified.
¹ APPEND is supported for TAR format only. Other formats (ZIP, TGZ, TAR.LZ4) were not designed for true append operations - only extract-all-recreate emulation, which significantly impacts performance.
