pytorch-lightning
pytorch-lightning copied to clipboard
Add ColossalAI strategy
What does this PR do?
Fixes https://github.com/hpcaitech/ColossalAI/issues/1330 Fixes https://github.com/Lightning-AI/lightning/issues/12733
Add ColossalAI strategy which supports ZeRO-DP with chunk-based memory management.
Does your PR introduce any breaking changes? If yes, please list them.
No.
Before submitting
- [x] Was this discussed/approved via a GitHub issue? (not for typos and docs)
- [ ] Did you read the contributor guideline, Pull Request section?
- [ ] Did you make sure your PR does only one thing, instead of bundling different changes together?
- [ ] Did you make sure to update the documentation with your changes? (if necessary)
- [ ] Did you write any new necessary tests? (not for typos and docs)
- [ ] Did you verify new and existing tests pass locally with your changes?
- [x] Did you list all the breaking changes introduced by this pull request?
- [ ] Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)
PR review
Anyone in the community is welcome to review the PR. Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
- [ ] Is this pull request ready for review? (if not, please submit in draft mode)
- [ ] Check that all items from Before submitting are resolved
- [ ] Make sure the title is self-explanatory and the description concisely explains the PR
- [ ] Add labels and milestones (and optionally projects) to the PR so it can be classified
Did you have fun?
Make sure you had fun coding 🙃
hey @ver217 @1SAA Great work here so far. I think we are close. Can you fix the missing imports? Then we can check it on the CI. Currently there's an error: https://dev.azure.com/Lightning-AI/lightning/_build/results?buildId=98164&view=logs&j=3f274fac-2e11-54ca-487e-194c91f3ae9f&t=fd22e4d0-068c-5651-f467-789f79e7bb7b
you can try running some tests locally to fix the import issues.
I don't have edit access here, so I can't push any commits.
also, do you plan to add the docs in this PR or in a follow-up?
Hi @rohitgr7
Thank you for your review. I've fixed the import issue. As for the docs, we've written the docstring in ColossalAIStrategy
. We don't know what else should we do in order to add docs. Could you tell me what else we can do for the docs?
Hi @rohitgr7
After our discussion, we decide to add the docs in another PR. Could we just merge this PR first?
okay.. looks like colossalai tests are not running and are skipped since it's not being installed on the CI. Installation instructions are here: pip install colossalai==0.1.10+torch1.12cu11.3
.
@1SAA can you add this to https://github.com/Lightning-AI/lightning/blob/master/requirements/pytorch/strategies.txt?
also I checked locally and tests are failing.
PL_RUN_STANDALONE_TESTS=1 pytest strategies/test_colossalai_strategy.py -v
one of the errors:
ImportError: Please install colossalai from source code to use HybridAdam
looks like HybridAdam
is not released yet.
Hi @rohitgr7
The package link is added and HybridAdam
can be imported after the installation. But I still have a problem in my local tests. I passed all my tests "standalone". Here, I test a single test funtion by setting the standalone
parameter to False and PL_RUN_STANDALONE_TESTS
to 0. Local tests seem can't run properly when PL_RUN_STANDALONE_TESTS=1
. I tried to run the file test_ddp.py
and it just ran forever. It seems the tests do not run really standalone, some deadlock problem is likely occured. I wonder is it normal to have deadlock in local tests and how the really testing plantform handles standalone tests.
Dockers are hopefully fixed. Will check in the morning
@rohitgr7, dockers are building. Docker image build of build-pl
is slightly wrong, because it bases itself on the result of build-cuda
images available in dockerhub, which is not available there yet, but it's not an issue, since result of build-cuda
will be pushed to dockerhub on merge to master, so once that happens, everything will work out as it should.
However, there is one GPU test, that is consistently failing, and I think it was written by @awaelchli, do you think you can take a look there, please?
So, I found the issue with the failing test. In #14984 we started making sure, that the fork process is not poisoned by CUDA calls. However, just importing colossalai
caused a call to torch.cuda.num_devices()
(through calling torch_amp
). The fix is to guard any colossalai
imports by our context manager _patch_cuda_is_available
. I'm not really sure, if this is the nicest possible solution, but hey, at least it works.
cc @carmocca to give a decision on whether it's okay enough solution