Introduce a python client to communicate with the mount
Tell us more about this new feature.
Hey there,
After doing some benchmarks using mountpoint-s3, I came to the conclusion the current offering isn't suitable for large scale deep learning training.
Unfortunately, this isn't due to the mount in itself but on how Deep Learning practitioners write their data loading / processing code, in particular the way DataLoader handles loading files.
The mount is particularly interesting when you fetch files in parallel but most code are sequentially reading files and then processing it. Even when using multiple processes, this is rapidly bottlenecked as i/o is a blocking operation.
Here are some numbers conducted on an m5n.8xlarge processing ImageNet 1.2M images using a simple PyTorch DataLoader to read the files and process them. I am using 32 workers with a batch size of 256.
| Data Locality | Time Taken |
|---|---|
| local | 10:25 |
| mountpoint-s3 | 2:42:45 |
This is roughly 15x time slower due to sequential reading from each workers. Moving to async i/o reading doesn't help much too.
However, I implemented a slightly different DataLoader version where we know the order of files ahead of time and can pre-fetch them on the fly in side workers directly from s3. I managed to get it down from 2:42:45 to 15min including the 5 min of listing overhead.
Ideally, I would like to have a python client to communicate with the mount to inform of files to be pre-fetched without opening them (blocking operation in python).
from mountpoints3 import Client
client = Client(...)
futures = client.prefetch_files([..., ..., ..., ...]
Interesting proposal. Linking to #255, as it would likely have to be built on top of those changes.