accelerate Loading the entire dataset into memory

Loading the entire dataset into memory

Open chinmay5 opened this issue 2 years ago • 1 comments

trafficstars

First of all, thank you so much for the wonderful library. I am working on a slurm cluster, and the disk r/o is very slow. I wanted to load the entire dataset (around 280 G) onto the RAM. The slurm cluster has around 400G RAM, and so this should work without much issues. However, since the multi-process execution would lead to each process having a local copy, I would run out of memory. Since the dataset is only used for reading, I wondered if there is a way to create a single copy of this dataset that can be loaded into the memory and shared between all the processes.

Thanks.

May 29 '23 09:05 chinmay5

You can use the dispatch_batches=True option and only load your dataset in the process 0 (loading something with the same length but no real samples in them in the other processes). Accelerate will then build the batches in process 0 only and dispatch them to all other workers.

May 30 '23 13:05 sgugger

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jun 28 '23 15:06 github-actions[bot]

accelerate accelerate copied to clipboard

Loading the entire dataset into memory

accelerate
accelerate copied to clipboard