streaming
streaming copied to clipboard
Fix dataloader hang at the end of an epoch
Description of changes:
Very rarely we see ready_thread assigned a higher priority when num_workers > 1. The observation is that ready_thread progresses way faster than preaprae_thread. It is unknown when/why that happens.
Since ready_thread also prepare_shard when a shard is marked as remote, the dataloader proceeds ok to finish an epoch. However, at the end of an epoch, the ready_thread and main iter thread both need to wait prepare_thread to finish the rest iteration, leading to a few minutes hang and GPU utilization drop.
Issue #, if available:
It is hypothetical that this issue and this issue are also relevant, where several users observed a throughput drop at the end of an epoch.
Merge Checklist:
Put an x
without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.
General
- [ ] I have read the contributor guidelines
- [ ] This is a documentation change or typo fix. If so, skip the rest of this checklist.
- [ ] I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
- [ ] I have updated any necessary documentation, including README and API docs (if appropriate).
Tests
- [ ] I ran
pre-commit
on my change. (check out thepre-commit
section of prerequisites) - [ ] I have added tests that prove my fix is effective or that my feature works (if appropriate).
- [ ] I ran the tests locally to make sure it pass. (check out testing)
- [ ] I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.