petastorm
petastorm copied to clipboard
Reader: shuffle row groups
Codecov Report
Merging #767 (5d18c4b) into master (3f24800) will increase coverage by
0.03%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #767 +/- ##
==========================================
+ Coverage 86.26% 86.30% +0.03%
==========================================
Files 85 85
Lines 5081 5095 +14
Branches 783 786 +3
==========================================
+ Hits 4383 4397 +14
Misses 559 559
Partials 139 139
Impacted Files | Coverage Δ | |
---|---|---|
petastorm/arrow_reader_worker.py | 91.19% <100.00%> (+0.28%) |
:arrow_up: |
petastorm/py_dict_reader_worker.py | 95.58% <100.00%> (+0.13%) |
:arrow_up: |
petastorm/reader.py | 90.86% <100.00%> (+0.16%) |
:arrow_up: |
petastorm/workers_pool/ventilator.py | 93.33% <100.00%> (+0.09%) |
:arrow_up: |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
Attach some benchmark result using synthetic dataset with PyTorch:
- Experiment setup:
BatchedDataLoader + make_batch_reader()
, batch_size=50000, shuffle_buffer_size=1000000, thread_pool, 10 workers. - Datasets from 100M rows to 1.6B rows are tested.
- Compare throughput of shuffle in dataloader and shuffle in reader .
Using shuffle in reader is expected to generate higher throughput since multiple workers are shuffling in parallel.
fyi @selitvin