composer
composer copied to clipboard
Mid Epoch Resumption of Streaming Datasets
Mid Epoch Resumption of Streaming Datasets
Uses a Feistel network to reproducibly and efficiently shuffle streaming datasets while constraining the maximum number of shards that must be present on the device at any given time.
Addresses:
- https://mosaicml.atlassian.net/browse/CO-408
- https://mosaicml.atlassian.net/browse/CO-580
Follows up on an issue diagnosed by https://mosaicml.atlassian.net/browse/CO-630
Rest of PR doesn't pass unit tests yet so not included here. Will mark as ready for review when non-skeleton code is merged.
(composer_venv) ~/m/composer ❮❮❮ pytest -s tests/datasets/test_streaming.py ✘ 4
============================================================================== test session starts ===============================================================================
platform darwin -- Python 3.8.9, pytest-7.1.0, pluggy-1.0.0
rootdir: /Users/cress/mosaic/composer, configfile: pyproject.toml
plugins: pytest_codeblocks-0.16.1, timeout-2.1.0, httpserver-1.0.4
collected 138 items / 104 deselected / 34 selected
tests/datasets/test_streaming.py ........................Successfully raised error: [Errno 2] No such file or directory: '/private/var/folders/x9/m4hh28855d37z6g2d3pqr5t80000gn/T/pytest-of-cress/pytest-377/test_reader_download_fail_inde0/remote/index.mds'
.Successfully raised error: [Errno 2] No such file or directory: '/private/var/folders/x9/m4hh28855d37z6g2d3pqr5t80000gn/T/pytest-of-cress/pytest-377/test_reader_download_fail_shar0/local/000001.mds'
....x....
================================================================= 33 passed, 104 deselected, 1 xfailed in 7.70s ==================================================================
(composer_venv) ~/m/composer ❯❯❯
root@interactive-a100-40gb-1-5crl-9cgf5:~/composer# pytest -m daily -s tests/datasets/test_streaming.py
============================================================ test session starts ============================================================
platform linux -- Python 3.9.13, pytest-7.1.0, pluggy-1.0.0
rootdir: /root/composer, configfile: pyproject.toml
plugins: httpserver-1.0.5, pytest_codeblocks-0.16.1
collected 137 items / 65 deselected / 72 selected
tests/datasets/test_streaming.py ..................XXXXXXXXXXXXXXXXXX..................XXXXXXXXXXXXXXXXXX
============================================== 36 passed, 65 deselected, 36 xpassed in 41.16s ===============================================
root@interactive-a100-40gb-1-5crl-9cgf5:~/composer# make test-dist WORLD_SIZE=2 EXTRA_ARGS="-s -m daily tests/datasets/test_streaming.py"
python3 -m composer.cli.launcher -n 2 --master_port 26000 -m pytest -s -m daily tests/datasets/test_streaming.py
============================================================ test session starts ============================================================
platform linux -- Python 3.9.13, pytest-7.1.0, pluggy-1.0.0
rootdir: /root/composer, configfile: pyproject.toml
plugins: httpserver-1.0.5, pytest_codeblocks-0.16.1
collected 137 items / 105 deselected / 32 selected
tests/datasets/test_streaming.py ................................
==================================================== 32 passed, 105 deselected in 25.80s ====================================================
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
root@interactive-a100-40gb-1-5crl-9cgf5:~/composer# make test-dist WORLD_SIZE=2 EXTRA_ARGS="-s -m daily tests/datasets/test_streaming.py"
python3 -m composer.cli.launcher -n 2 --master_port 26000 -m pytest -s -m daily tests/datasets/test_streaming.py
============================================================ test session starts ============================================================
platform linux -- Python 3.9.13, pytest-7.1.0, pluggy-1.0.0
rootdir: /root/composer, configfile: pyproject.toml
plugins: httpserver-1.0.5, pytest_codeblocks-0.16.1
collected 137 items / 105 deselected / 32 selected
tests/datasets/test_streaming.py ................................
==================================================== 32 passed, 105 deselected in 25.80s ====================================================
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
root@interactive-a100-40gb-1-5crl-9cgf5:~/composer# pytest -m remote -s tests/datasets/test_streaming_remote.py
============================================================ test session starts ============================================================
platform linux -- Python 3.9.13, pytest-7.1.0, pluggy-1.0.0
rootdir: /root/composer, configfile: pyproject.toml
plugins: httpserver-1.0.5, pytest_codeblocks-0.16.1
collected 12 items / 1 deselected / 11 selected
tests/datasets/test_streaming_remote.py xBuilt dataset
samples read: 1000
samples read: 2000
build_dur=1.26s, iter_dur=5.90, samples_per_sec=338.77
XxBuilt dataset
samples read: 1000
samples read: 2000
samples read: 3000
samples read: 4000
samples read: 5000
samples read: 6000
samples read: 7000
samples read: 8000
samples read: 9000
samples read: 10000
samples read: 11000
samples read: 12000
samples read: 13000
samples read: 14000
samples read: 15000
samples read: 16000
samples read: 17000
samples read: 18000
samples read: 19000
samples read: 20000
samples read: 21000
samples read: 22000
samples read: 23000
samples read: 24000
samples read: 25000
samples read: 26000
samples read: 27000
samples read: 28000
samples read: 29000
samples read: 30000
samples read: 31000
samples read: 32000
samples read: 33000
samples read: 34000
samples read: 35000
samples read: 36000
samples read: 37000
samples read: 38000
samples read: 39000
samples read: 40000
samples read: 41000
samples read: 42000
samples read: 43000
samples read: 44000
samples read: 45000
samples read: 46000
samples read: 47000
samples read: 48000
samples read: 49000
samples read: 50000
build_dur=1.17s, iter_dur=199.81, samples_per_sec=250.24
XBuilt dataset
samples read: 1000
samples read: 2000
samples read: 3000
samples read: 4000
build_dur=1.17s, iter_dur=53.67, samples_per_sec=92.27
XBuilt dataset
samples read: 1000
samples read: 2000
samples read: 3000
samples read: 4000
samples read: 5000
samples read: 6000
samples read: 7000
samples read: 8000
samples read: 9000
samples read: 10000
build_dur=1.06s, iter_dur=9.20, samples_per_sec=1086.61
XBuilt dataset
ds_build_dur=0.93s, loader_build_dur=0.00s
samples read: 104
samples read: 208
samples read: 312
samples read: 416
samples read: 520
samples read: 624
samples read: 728
samples read: 832
samples read: 936
samples read: 1040
samples read: 1144
samples read: 1248
samples read: 1352
samples read: 1456
samples read: 1560
samples read: 1664
samples read: 1768
samples read: 1872
samples read: 1976
Epoch 0: epoch_dur=3.88s, samples_per_sec=515.46
samples read: 104
samples read: 208
samples read: 312
samples read: 416
samples read: 520
samples read: 624
samples read: 728
samples read: 832
samples read: 936
samples read: 1040
samples read: 1144
samples read: 1248
samples read: 1352
samples read: 1456
samples read: 1560
samples read: 1664
samples read: 1768
samples read: 1872
samples read: 1976
Epoch 1: epoch_dur=1.59s, samples_per_sec=1257.52
samples read: 104
samples read: 208
samples read: 312
samples read: 416
samples read: 520
samples read: 624
samples read: 728
samples read: 832
samples read: 936
samples read: 1040
samples read: 1144
samples read: 1248
samples read: 1352
samples read: 1456
samples read: 1560
samples read: 1664
samples read: 1768
samples read: 1872
samples read: 1976
Epoch 2: epoch_dur=1.58s, samples_per_sec=1265.63
XBuilt dataset
ds_build_dur=0.94s, loader_build_dur=0.00s
samples read: 2504
samples read: 5008
samples read: 7512
samples read: 10016
samples read: 12520
samples read: 15024
samples read: 17528
samples read: 20032
samples read: 22536
samples read: 25040
samples read: 27544
samples read: 30048
samples read: 32552
samples read: 35056
samples read: 37560
samples read: 40064
samples read: 42568
samples read: 45072
samples read: 47576
Epoch 0: epoch_dur=52.45s, samples_per_sec=953.24
samples read: 2504
samples read: 5008
samples read: 7512
samples read: 10016
samples read: 12520
samples read: 15024
samples read: 17528
samples read: 20032
samples read: 22536
samples read: 25040
samples read: 27544
samples read: 30048
samples read: 32552
samples read: 35056
samples read: 37560
samples read: 40064
samples read: 42568
samples read: 45072
samples read: 47576
Epoch 1: epoch_dur=17.06s, samples_per_sec=2931.14
samples read: 2504
samples read: 5008
samples read: 7512
samples read: 10016
samples read: 12520
samples read: 15024
samples read: 17528
samples read: 20032
samples read: 22536
samples read: 25040
samples read: 27544
samples read: 30048
samples read: 32552
samples read: 35056
samples read: 37560
samples read: 40064
samples read: 42568
samples read: 45072
samples read: 47576
Epoch 2: epoch_dur=17.28s, samples_per_sec=2893.10
XBuilt dataset
ds_build_dur=1.05s, loader_build_dur=0.00s
samples read: 248
samples read: 496
samples read: 744
samples read: 992
samples read: 1240
samples read: 1488
samples read: 1736
samples read: 1984
samples read: 2232
samples read: 2480
samples read: 2728
samples read: 2976
samples read: 3224
samples read: 3472
samples read: 3720
samples read: 3968
samples read: 4216
samples read: 4464
samples read: 4712
Epoch 0: epoch_dur=6.80s, samples_per_sec=727.81
samples read: 248
samples read: 496
samples read: 744
samples read: 992
samples read: 1240
samples read: 1488
samples read: 1736
samples read: 1984
samples read: 2232
samples read: 2480
samples read: 2728
samples read: 2976
samples read: 3224
samples read: 3472
samples read: 3720
samples read: 3968
samples read: 4216
samples read: 4464
samples read: 4712
Epoch 1: epoch_dur=2.86s, samples_per_sec=1733.51
samples read: 248
samples read: 496
samples read: 744
samples read: 992
samples read: 1240
samples read: 1488
samples read: 1736
samples read: 1984
samples read: 2232
samples read: 2480
samples read: 2728
samples read: 2976
samples read: 3224
samples read: 3472
samples read: 3720
samples read: 3968
samples read: 4216
samples read: 4464
samples read: 4712
Epoch 2: epoch_dur=2.78s, samples_per_sec=1784.27
XBuilt dataset
ds_build_dur=0.99s, loader_build_dur=0.00s
samples read: 504
samples read: 1008
samples read: 1512
samples read: 2016
samples read: 2520
samples read: 3024
samples read: 3528
samples read: 4032
samples read: 4536
samples read: 5040
samples read: 5544
samples read: 6048
samples read: 6552
samples read: 7056
samples read: 7560
samples read: 8064
samples read: 8568
samples read: 9072
samples read: 9576
Epoch 0: epoch_dur=3.09s, samples_per_sec=3234.36
samples read: 504
samples read: 1008
samples read: 1512
samples read: 2016
samples read: 2520
samples read: 3024
samples read: 3528
samples read: 4032
samples read: 4536
samples read: 5040
samples read: 5544
samples read: 6048
samples read: 6552
samples read: 7056
samples read: 7560
samples read: 8064
samples read: 8568
samples read: 9072
samples read: 9576
Epoch 1: epoch_dur=0.81s, samples_per_sec=12369.47
samples read: 504
samples read: 1008
samples read: 1512
samples read: 2016
samples read: 2520
samples read: 3024
samples read: 3528
samples read: 4032
samples read: 4536
samples read: 5040
samples read: 5544
samples read: 6048
samples read: 6552
samples read: 7056
samples read: 7560
samples read: 8064
samples read: 8568
samples read: 9072
samples read: 9576
Epoch 2: epoch_dur=0.82s, samples_per_sec=12140.28
XBuilt dataset
ds_build_dur=5.94s, loader_build_dur=0.00s
samples read: 18232
samples read: 36464
samples read: 54696
samples read: 72928
samples read: 91160
samples read: 109392
samples read: 127624
samples read: 145856
samples read: 164088
samples read: 182320
samples read: 200552
samples read: 218784
samples read: 237016
samples read: 255248
samples read: 273480
samples read: 291712
samples read: 309944
samples read: 328176
samples read: 346408
Epoch 0: epoch_dur=73.36s, samples_per_sec=4969.93
samples read: 18232
samples read: 36464
samples read: 54696
samples read: 72928
samples read: 91160
samples read: 109392
samples read: 127624
samples read: 145856
samples read: 164088
samples read: 182320
samples read: 200552
samples read: 218784
samples read: 237016
samples read: 255248
samples read: 273480
samples read: 291712
samples read: 309944
samples read: 328176
samples read: 346408
Epoch 1: epoch_dur=70.34s, samples_per_sec=5183.39
samples read: 18232
samples read: 36464
samples read: 54696
samples read: 72928
samples read: 91160
samples read: 109392
samples read: 127624
samples read: 145856
samples read: 164088
samples read: 182320
samples read: 200552
samples read: 218784
samples read: 237016
samples read: 255248
samples read: 273480
samples read: 291712
samples read: 309944
samples read: 328176
samples read: 346408
Epoch 2: epoch_dur=69.69s, samples_per_sec=5231.55
X
========================================== 1 deselected, 2 xfailed, 9 xpassed in 608.44s (0:10:08) ==========================================
root@interactive-a100-40gb-1-5crl-9cgf5:~/composer#
Closing this out since we are no longer adding new features to V1, this will be supported in V2.