NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

[FEA] Add batched files reading to separate_by_metadata.py

Open miguelusque opened this issue 1 year ago • 2 comments
trafficstars

Is your feature request related to a problem? Please describe. separate_by_metadata.py script reads all the files at once, and distributes them through the different Dask workers. That could lead to OOMs.

Describe the solution you'd like To read the files in batches, to reduce the chances of an OOM.

miguelusque avatar May 06 '24 21:05 miguelusque