NeMo-Curator
NeMo-Curator copied to clipboard
[DRAFT] Trying dask_cudf's read_json / read_parquet
trafficstars
Description
Reading 6000 files of ~25mb each, i.e ~145gb over 8GPUs
| add_filename | partition_size | input_meta | Using dask.read_json #285 |
Providing meta in dask.from_map #291 |
|---|---|---|---|---|
| False | 2gb | Specified | 24.9 s ± 330 ms | 25.9 s ± 520 ms |
| False | 2gb | None | 24.9 s ± 470 ms | OOM |
| True | 2gb | Specified | 55 s ± 177 ms | 53.2 s ± 350 ms per loop |
| True | 2gb | None | 54.8 s ± 248 ms | 64s ± 289 ms per loop |
| Using dask.read_json #285 | Providing meta in dask.from_map #291 |
|---|---|
First two are add_filename=False, latter two are True where we see a lower utilization |
The first one is add_filename=False, and the latter are True where we see a lower utilization |
Usage
# Add snippet demonstrating usage
Checklist
- [ ] I am familiar with the Contributing Guide.
- [ ] New or Existing tests cover these changes.
- [ ] The documentation is up to date with these changes.