[DOC] How to partition a dataset with nvt.Dataset & workflow so that each parquet contains the same number of samples?

Open zxgx opened this issue 3 years ago • 0 comments

Report needed documentation

Report needed documentation I followed the example to preprocess Criteo dataset with 8 workers, and let each worker generate 8 parquets by setting out_files_per_proc=out_files_per_proc in the workflow. However, I found that each of the resulted 64 files has different number of samples, and the TorchAsyncItr would generate different number of batches in multiprocessing setting, which cannot be properly handled by my code. So I would like to know how to generate equal amount of batches with TorchAsyncItr in multiprocessing.

Describe the documentation you'd like A jupyter notebook to demonstrate how to achieve my goal would be great.

Aug 12 '22 11:08 zxgx