petastorm icon indicating copy to clipboard operation
petastorm copied to clipboard

Reader: shuffle row groups

Open chongxiaoc opened this issue 2 years ago • 1 comments

chongxiaoc avatar Aug 09 '22 23:08 chongxiaoc

Codecov Report

Merging #767 (5d18c4b) into master (3f24800) will increase coverage by 0.03%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #767      +/-   ##
==========================================
+ Coverage   86.26%   86.30%   +0.03%     
==========================================
  Files          85       85              
  Lines        5081     5095      +14     
  Branches      783      786       +3     
==========================================
+ Hits         4383     4397      +14     
  Misses        559      559              
  Partials      139      139              
Impacted Files Coverage Δ
petastorm/arrow_reader_worker.py 91.19% <100.00%> (+0.28%) :arrow_up:
petastorm/py_dict_reader_worker.py 95.58% <100.00%> (+0.13%) :arrow_up:
petastorm/reader.py 90.86% <100.00%> (+0.16%) :arrow_up:
petastorm/workers_pool/ventilator.py 93.33% <100.00%> (+0.09%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov[bot] avatar Aug 10 '22 07:08 codecov[bot]

Attach some benchmark result using synthetic dataset with PyTorch:

  • Experiment setup: BatchedDataLoader + make_batch_reader(), batch_size=50000, shuffle_buffer_size=1000000, thread_pool, 10 workers.
  • Datasets from 100M rows to 1.6B rows are tested.
  • Compare throughput of shuffle in dataloader and shuffle in reader .

Synthetic Dataset PyTorch Throughput

Using shuffle in reader is expected to generate higher throughput since multiple workers are shuffling in parallel.

fyi @selitvin

chongxiaoc avatar Aug 19 '22 06:08 chongxiaoc