NeMo-Curator
NeMo-Curator copied to clipboard
[DRAFT] Trying dask_cudf's read_json / read_parquet
Description
Reading 6000 files of ~25mb each, i.e ~145gb over 8GPUs
add_filename | partition_size | input_meta | Using dask.read_json #285 |
Providing meta in dask.from_map #291 |
---|---|---|---|---|
False | 2gb | Specified | 24.9 s ± 330 ms | 25.9 s ± 520 ms |
False | 2gb | None | 24.9 s ± 470 ms | OOM |
True | 2gb | Specified | 55 s ± 177 ms | 53.2 s ± 350 ms per loop |
True | 2gb | None | 54.8 s ± 248 ms | 64s ± 289 ms per loop |
Using dask.read_json #285 | Providing meta in dask.from_map #291 |
---|---|
First two are add_filename=False , latter two are True where we see a lower utilization |
The first one is add_filename=False , and the latter are True where we see a lower utilization |
Usage
# Add snippet demonstrating usage
Checklist
- [ ] I am familiar with the Contributing Guide.
- [ ] New or Existing tests cover these changes.
- [ ] The documentation is up to date with these changes.