pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

Add ColossalAI strategy

Open ver217 opened this issue 2 years ago • 5 comments

What does this PR do?

Fixes https://github.com/hpcaitech/ColossalAI/issues/1330 Fixes https://github.com/Lightning-AI/lightning/issues/12733

Add ColossalAI strategy which supports ZeRO-DP with chunk-based memory management.

Does your PR introduce any breaking changes? If yes, please list them.

No.

Before submitting

  • [x] Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • [ ] Did you read the contributor guideline, Pull Request section?
  • [ ] Did you make sure your PR does only one thing, instead of bundling different changes together?
  • [ ] Did you make sure to update the documentation with your changes? (if necessary)
  • [ ] Did you write any new necessary tests? (not for typos and docs)
  • [ ] Did you verify new and existing tests pass locally with your changes?
  • [x] Did you list all the breaking changes introduced by this pull request?
  • [ ] Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR. Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

  • [ ] Is this pull request ready for review? (if not, please submit in draft mode)
  • [ ] Check that all items from Before submitting are resolved
  • [ ] Make sure the title is self-explanatory and the description concisely explains the PR
  • [ ] Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

ver217 avatar Aug 16 '22 03:08 ver217

hey @ver217 @1SAA Great work here so far. I think we are close. Can you fix the missing imports? Then we can check it on the CI. Currently there's an error: https://dev.azure.com/Lightning-AI/lightning/_build/results?buildId=98164&view=logs&j=3f274fac-2e11-54ca-487e-194c91f3ae9f&t=fd22e4d0-068c-5651-f467-789f79e7bb7b

you can try running some tests locally to fix the import issues.

I don't have edit access here, so I can't push any commits.

also, do you plan to add the docs in this PR or in a follow-up?

rohitgr7 avatar Sep 13 '22 07:09 rohitgr7

Hi @rohitgr7

Thank you for your review. I've fixed the import issue. As for the docs, we've written the docstring in ColossalAIStrategy. We don't know what else should we do in order to add docs. Could you tell me what else we can do for the docs?

1SAA avatar Sep 13 '22 09:09 1SAA

Hi @rohitgr7

After our discussion, we decide to add the docs in another PR. Could we just merge this PR first?

1SAA avatar Sep 15 '22 01:09 1SAA

okay.. looks like colossalai tests are not running and are skipped since it's not being installed on the CI. Installation instructions are here: pip install colossalai==0.1.10+torch1.12cu11.3. @1SAA can you add this to https://github.com/Lightning-AI/lightning/blob/master/requirements/pytorch/strategies.txt?

also I checked locally and tests are failing.

PL_RUN_STANDALONE_TESTS=1 pytest strategies/test_colossalai_strategy.py -v

one of the errors:

ImportError: Please install colossalai from source code to use HybridAdam

looks like HybridAdam is not released yet.

rohitgr7 avatar Sep 15 '22 18:09 rohitgr7

Hi @rohitgr7

The package link is added and HybridAdam can be imported after the installation. But I still have a problem in my local tests. I passed all my tests "standalone". Here, I test a single test funtion by setting the standalone parameter to False and PL_RUN_STANDALONE_TESTS to 0. Local tests seem can't run properly when PL_RUN_STANDALONE_TESTS=1. I tried to run the file test_ddp.py and it just ran forever. It seems the tests do not run really standalone, some deadlock problem is likely occured. I wonder is it normal to have deadlock in local tests and how the really testing plantform handles standalone tests.

1SAA avatar Sep 19 '22 09:09 1SAA

Dockers are hopefully fixed. Will check in the morning

otaj avatar Oct 07 '22 22:10 otaj

@rohitgr7, dockers are building. Docker image build of build-pl is slightly wrong, because it bases itself on the result of build-cuda images available in dockerhub, which is not available there yet, but it's not an issue, since result of build-cuda will be pushed to dockerhub on merge to master, so once that happens, everything will work out as it should.

However, there is one GPU test, that is consistently failing, and I think it was written by @awaelchli, do you think you can take a look there, please?

otaj avatar Oct 08 '22 09:10 otaj

So, I found the issue with the failing test. In #14984 we started making sure, that the fork process is not poisoned by CUDA calls. However, just importing colossalai caused a call to torch.cuda.num_devices() (through calling torch_amp). The fix is to guard any colossalai imports by our context manager _patch_cuda_is_available. I'm not really sure, if this is the nicest possible solution, but hey, at least it works.

cc @carmocca to give a decision on whether it's okay enough solution

otaj avatar Oct 10 '22 13:10 otaj