streaming icon indicating copy to clipboard operation
streaming copied to clipboard

Fix dataloader hang at the end of an epoch

Open XiaohanZhangCMU opened this issue 6 months ago • 0 comments

Description of changes:

Very rarely we see ready_thread assigned a higher priority when num_workers > 1. The observation is that ready_thread progresses way faster than preaprae_thread. It is unknown when/why that happens.

Since ready_thread also prepare_shard when a shard is marked as remote, the dataloader proceeds ok to finish an epoch. However, at the end of an epoch, the ready_thread and main iter thread both need to wait prepare_thread to finish the rest iteration, leading to a few minutes hang and GPU utilization drop.

Issue #, if available:

It is hypothetical that this issue and this issue are also relevant, where several users observed a throughput drop at the end of an epoch.

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • [ ] I have read the contributor guidelines
  • [ ] This is a documentation change or typo fix. If so, skip the rest of this checklist.
  • [ ] I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
  • [ ] I have updated any necessary documentation, including README and API docs (if appropriate).

Tests

  • [ ] I ran pre-commit on my change. (check out the pre-commit section of prerequisites)
  • [ ] I have added tests that prove my fix is effective or that my feature works (if appropriate).
  • [ ] I ran the tests locally to make sure it pass. (check out testing)
  • [ ] I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.

XiaohanZhangCMU avatar Aug 02 '24 02:08 XiaohanZhangCMU