NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

[DRAFT] Trying dask_cudf's read_json / read_parquet

Open praateekmahajan opened this issue 4 months ago • 0 comments

Description

Reading 6000 files of ~25mb each, i.e ~145gb over 8GPUs

add_filename partition_size input_meta Using dask.read_json #285 Providing meta in
dask.from_map #291
False 2gb Specified 24.9 s ± 330 ms 25.9 s ± 520 ms
False 2gb None 24.9 s ± 470 ms OOM
True 2gb Specified 55 s ± 177 ms 53.2 s ± 350 ms per loop
True 2gb None 54.8 s ± 248 ms 64s ± 289 ms per loop
Using dask.read_json #285 Providing meta in dask.from_map #291
image image
First two are add_filename=False, latter two are True where we see a lower utilization The first one is add_filename=False, and the latter are True where we see a lower utilization

Usage

# Add snippet demonstrating usage

Checklist

  • [ ] I am familiar with the Contributing Guide.
  • [ ] New or Existing tests cover these changes.
  • [ ] The documentation is up to date with these changes.

praateekmahajan avatar Oct 08 '24 21:10 praateekmahajan