The current state of the art involves executing data pre-processing, augmentation, and a wide variety of custom ETL workflows on individual client machines. This approach lacks scalability and often results in significant performance degradation due to unnecessary data movement. Unlike most open-source and cloud ETL solutions, AIStore performs transformations on the same machines that store your data, minimizing redundant transfers by exploiting data locality.

This is precisely why we developed an ETL subsystem for AIStore. As a distributed storage system, AIStore has precise knowledge of where every object resides within your dataset. This awareness eliminates redundant data movement, streamlining the entire pipeline and delivering, as we demonstrate below, a 35x speedup compared to conventional client-side ETL workflows.

This post highlights recent performance improvements across all three ETL stages in AIStore:

  • Extract: Introduced WebSocket streaming communication mechanism between AIStore and the ETL web server.
  • Transform: Provided a high-performance SDK framework to build ETL web server with custom transformation logic.
  • Load: Enabled direct put, allowing ETL output to bypass intermediaries and write directly to the target node.

WebSocket Streaming: Persistent and Reusable Communication Channel

In distributed systems, the choice of communication protocol can significantly impact performance. In AIStore ETL (AIS ETL), the existing mechanisms (Hpush, Hpull, and I/O) rely on RESTful HTTP interfaces to transfer user data in HTTP requests and responses. While this approach provides clean isolation between the TCP sessions used for each individual object, it also introduces overhead: each object transformation requires to establish a new HTTP session, which can be costly, especially for small objects.

To reduce this overhead, AIStore ETL now supports a new communication type: WebSocket.

WebSocket enables long-lived TCP connections that facilitate continuous data streaming between targets and ETL containers without repeatedly setting up and tearing down. In contrast, HTTP-based traffic involves repeated connection establishment and termination with an overhead that grows inversely proportionally to the transfer sizes”

Specifically, for each offline (bucket-to-bucket) transform request, AIStore:

  1. Pre-establishes multiple long-lived WebSocket connections to the ETL server.
  2. Distributes source objects across these connections.
  3. Streams transformed object bytes back through the same connection.

ETL WebSocket

Note: With direct put optimization enabled, the transformed object is delivered to the destination target via an HTTP PUT request instead of being streamed back over the WebSocket connection.

This upgrade significantly reduces overhead and improves throughput, especially in high-volume transformations involving small-sized objects.

High-Performance ETL Web Server Framework

AIStore ETL is language- and framework-agnostic. You can deploy your own custom web server as a transformation pod, supporting both inline transformations (real-time GET requests) and offline batch transformations (bucket-to-bucket).

However, building such a server from scratch involves more than just writing transformation logic. It must also be capable of:

  • Performing health checks
  • Communicating with AIStore targets
  • Parsing etl args—user-defined parameters that control the transformation behavior
  • Supporting direct put, which allows transformed objects to be directly written to the target bucket without going through the client
  • Managing HTTP and WebSocket protocols with proper concurrency control

Selecting the right web server and communication strategy depends on factors like object size and volume, desired concurrency model, and whether you need a synchronous (WSGI) or asynchronous (ASGI) stack. Each option has its own trade-offs.

To simplify this, we’ve introduced AIS-ETL Web Server Framework in both Go and Python. These SDKs abstract away the boilerplate—so you can build and deploy custom ETL containers in minutes. Focus solely on your transformation logic; the SDK handles everything else, including networking, protocol handling, and high-throughput optimizations.

Getting Started

Ready to build your own ETL server? Check out our quickstart guides and example implementations:

Direct Put Optimization: Faster Bucket-to-Bucket ETL Transformation

In offline transformations, the destination target for a transformed object usually differs from the original target. By default, the ETL container sends the transformed data back to the original source target, which then forwards it to the destination. The direct put optimization streamlines this flow by allowing the ETL container to send the transformed object directly to the destination target.

AIStore ETL Direct Put Optimization

Benchmark: FFmpeg Audio Transformation

Now that we’ve introduced the ETL server framework, direct put optimization and websocket streaming, the next logical step is to see how they actually perform in a real-world scenario.

To evaluate AIStore’s ETL throughput, scalability, and impact of features like direct_put and fqn, we designed a benchmark using FFmpeg—a popular open-source tool for audio and video processing. This benchmark simulates a practical workload and helps highlight how different implementation choices affect performance.

The transformation pipeline converts .flac files to .wav, with tunable parameters for Audio Channels (AC) and Audio Rate (AR).

Why FFmpeg?

  1. Practicality – FFmpeg is widely adopted, including within NVIDIA’s own speech teams.
  2. Relevance – Audio transformation is a byte-level, compute-intensive task, ideal for end-to-end ETL benchmarking.

To expose FFmpeg as a transformation service, we used three different Python-based web servers—FastAPI, Flask, and a multi-threaded HTTP server. Each implementation was tested under multiple configurations, toggling direct_put (to control upload paths) and fqn (to apply fully qualified object naming).

1. Setup and Configuration

To ensure the benchmark reflects production-like conditions, we ran all tests on a high-performance Kubernetes cluster with ample CPU, memory, and storage capacity.

1.1. System Specs

The benchmark was run on a high-performance AIStore cluster using the following system configuration:

  • Kubernetes Cluster: 3 bare-metal nodes, each hosting one proxy and one target
  • Storage: 10 × 9.1 TiB HDDs per target
  • CPU: 48 cores per node
  • Memory: 187 GiB per node

1.2. Transformer and Scripts

The transformer and benchmark were implemented using the following components from the AIS ETL repository:

1.3. Dataset

We used the LibriSpeech (500 Hours) dataset, which contains approximately 149,000 .flac audio files. Each file was converted to stereo at 44,100 Hz and saved as .wav.

2. Baseline: Client-Side Transformation

Before diving into AIStore-based benchmarks, it’s helpful to understand the baseline: how long would this transformation take using traditional methods?

In typical object storage workflows, data preprocessing happens on client machines—data is pulled, processed locally, then uploaded back. We replicated this using a client machine with specs identical to the cluster nodes and ran the transformation with 24 threads.

3. AIStore ETL Benchmarks

With the baseline in place, we turned to benchmarking server-side transformation using AIStore ETL. We tested all three web servers—FastAPI, Flask, and HTTP—across multiple combinations of:

  • Communication protocols: HPUSH, HPULL, WebSocket
  • fqn enabled/disabled: Whether to fully qualify object names
  • direct_put enabled/disabled: Whether the transformer directly writes back to the target bucket

Each configuration was measured for total transformation time.

3.1. Results Summary

The following chart summarizes performance across the different configurations:

benchmark

Webserver Communication Type FQN Direct Put Seconds
fastapi ws fqn True 218.541
flask hpush fqn False 258.171
flask hpull fqn False 258.868
fastapi hpush   True 259.963
fastapi hpush fqn True 260.094
flask hpull fqn True 261.366
flask hpush fqn True 261.658
fastapi hpush   False 262.539
fastapi hpull fqn True 262.666
fastapi hpull fqn False 263.526
fastapi hpush fqn False 266.175
flask hpush   False 270.632
fastapi hpull   True 270.923
flask hpush   True 272.426
flask hpull   False 275.475
fastapi hpull   False 277.491
flask hpull   True 277.512
fastapi ws fqn True 298.657
http hpush   False 419.697
http hpull fqn False 441.644
http hpush fqn False 444.687
http hpush   True 636.434
http hpush fqn True 652.89
http hpull   False 906.59
http hpull   True 1070.84

4. Performance Insights

The best-performing configuration—FastAPI using WebSocket, with fqn and direct_put enabled—completed the entire job in just 3 minutes 39 seconds.

This marks a ~35× speedup compared to the client-side baseline, where the same workload took over 2 hours to finish.

⚠️ Caveat: The baseline was run on a single client machine, while the AIStore benchmark used a 3-node cluster. This is not a 1:1 hardware comparison—but that’s exactly the point. AIStore ETL is built to scale horizontally. Client-side processing, even on powerful machines, hits I/O and CPU limits quickly. In contrast, AIStore parallelizes the workload across all available nodes.

Beyond raw speed, AIStore ETL eliminates unnecessary data movement, reduces client-side compute costs, and simplifies pipeline management. As the cluster size grows, performance scales linearly with added nodes/disks, offering even greater throughput potential.