NeMo-Curator
NeMo-Curator copied to clipboard
[FEA] Add batched files reading to separate_by_metadata.py
trafficstars
Is your feature request related to a problem? Please describe. separate_by_metadata.py script reads all the files at once, and distributes them through the different Dask workers. That could lead to OOMs.
Describe the solution you'd like To read the files in batches, to reduce the chances of an OOM.