Copying existing file datasets in two easy steps

AIStore supports numerous ways to copy, download, or otherwise transfer existing datasets. Much depends on where is the dataset in question, and whether we can access this location using some of sort of HTTP-based interface. I’ll put more references below. But in this post, let’s talk about datasets that already reside on premises.

The term on premises here includes a super-wide spectrum of use cases ranging from commercial high-end (and, possibly, distributed) filers to your own Linux or Mac (where you may, or may not, want to run AIStore itself, etc.).

Ultimately, the only precondition is that there is a directory you can access that contains files to migrate or copy. It turns out that everything else can be done in two easy steps:

Run local HTTP server.
Prefetch or download the files.

Implementation-wise, both Step 1 and Step 2 have multiple variations and we’ll consider at least some of them below. But first, let’s take a look at an example:

# Let's assume, the files we want to copy are located under /tmp/abc:
$ cd /tmp
$ ls abc
hello-world
shard-001.tar
...
shard-999.tar

# Step 1. run local http server
# =============================
$ python3 -m http.server --bind 0.0.0.0 51061

# use AIS CLI to make sure the files are readable
$ ais get http://localhost:51061/abc/hello-world

# Step 2. get all files in the range 'shard-{001..999}.tar'
# =========================================================

# keep using AIS CLI to list HTTP buckets
# (and note that AIS will create one on the fly after the very first successful `GET`)
$ ais ls ht://
ht://ZDE1YzE0NzhiNWFkMQ

# run batch `prefetch` job to load bash-expansion templated names from this bucket
$ ais start prefetch ht://ZDE1YzE0NzhiNWFkMQ --template 'shard-{001..999}.tar'

Here we run Python’s own http.server to listen on port 51061 and serve the files from the directory that we have previously cd-ed into (/tmp, in the example).

Of course, the port, the directory, and the filenames above are all randomly chosen for purely illustrative purposes. The main point the example is trying to make is that HTTP connectivity of any kind immediately opens up a number of easy ways to migrate or replicate any data that exists in files.

As far as aforementioned implementation variations, they include running, for instance, Go-based HTTP server instead of the Python’s (htserver.go):

package main

import "net/http"

func main() {
	http.ListenAndServe(":52062", http.FileServer(http.Dir("/tmp")))
}

and then using AIS downloader extension instead of the multi-object prefetch that we have used above:

# Step 1. run local http server
# =============================
$ go run htserver.go

# Step 2. download 10 files named shard-{001..010}.tar
# ====================================================

# `hostname` below indicates the hostname or IP address of the machine where
# we are running `go run htserver.go`;
# also note that the destination bucket `ais://abc` will be created iff it doesn't exist
$ ais start download "http://hostname:52062/abc/shard-{001..010}.tar" ais://abc
GUsQcjEPY
Run `ais show job download GUsQcjEPY --progress` to monitor the progress.

# list objects in the bucket we have just created:
$ ais ls ais://abc
NAME             SIZE
shard-001.tar    151.13KiB
shard-002.tar    147.98KiB
shard-003.tar    101.45KiB
shard-004.tar    150.37KiB
shard-005.tar    146.00KiB
shard-006.tar    130.70KiB
shard-007.tar    129.04KiB
shard-008.tar    157.53KiB
shard-009.tar    161.32KiB
shard-010.tar    124.11KiB

Other than a different, albeit still arbitrary, listening port and a user-selected destination bucket, minor differences include explicitly naming the directory from which we want to serve files. And also an attempt to indicate that if it works for localhost it’ll work for any valid hostname or IP address. For as long as the latter is visible over HTTP.

References