streaming
streaming copied to clipboard
parallel merge index
Description of changes:
Make merge_index utility run in parallel with multiprocessing. Note the normal use case for merge index happens after mds shards are written to a number of partition folders (remote or local), and one wants to merge the index files of those folders into one merged index file. Depending on the number of cores available at the driver node, downloading and merging use all the available cores.
profiling with 40 cores. Explanation of the profile table:
- with total number of mds files = 131072, and numbr of sreams = 8192, distribution = uniform, it means the root folder has 8192 subfolders, each folder has one index.json, so 8192 index.json to be merged. Each stream has 131072/8192 shards in it, so all index.json files have the same size.
- with the same numbers as above but distribution = expoential, the ony difference is that the size of index.json files are not uniformly distributed, but skewed (exponentially distributed). This represents the cases where some streams are particularly heavier than some others.
Downloading
total number of mds files | number of streams | mds file distribution | Serial Download (s) | . Parallel Download |
---|---|---|---|---|
131072 | 8192 | uniform | 317 | 45 |
1048576 | 8192 | uniform | x | 9 |
8388608 | 8192 | uniform | 314 | 48 |
16384 | 256 | exponential | 10 | 2.2 |
131072 | 8192 | exponential | 322 | 49 |
2097152 | 8192 | exponential | 316 | 50 |
8388608 | 8192 | exponential | 315 | 50 |
Merging
total number of mds files | number of streams | mds file distribution | Serial Merge (s) | . Parallel Merge |
---|---|---|---|---|
131072 | 8192 | uniform | 1 | 1 |
1048576 | 8192 | uniform | 1 | 1 |
8388608 | 8192 | uniform | 21 | 20 |
16384 | 256 | exponential | 1 | 1 |
131072 | 8192 | exponential | 1 | 1 |
2097152 | 8192 | exponential | 5 | 5 |
8388608 | 8192 | exponential | 20 | 19 |
Issue #, if available:
Merge Checklist:
Put an x
without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.
General
- [ ] I have read the contributor guidelines
- [ ] This is a documentation change or typo fix. If so, skip the rest of this checklist.
- [ ] I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
- [ ] I have updated any necessary documentation, including README and API docs (if appropriate).
Tests
- [ ] I ran
pre-commit
on my change. (check out thepre-commit
section of prerequisites) - [ ] I have added tests that prove my fix is effective or that my feature works (if appropriate).
- [ ] I ran the tests locally to make sure it pass. (check out testing)
- [ ] I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.
@knighton I updated the description to include some of the profiling results. PTA~
@XiaohanZhangCMU is this ready for another round of reviewing? would be good to get it in
@XiaohanZhangCMU what remaining changes do we need here?