lightning-flash icon indicating copy to clipboard operation
lightning-flash copied to clipboard

Add finetuning strategies for DeepSpeed

Open ar90n opened this issue 3 years ago • 7 comments

What does this PR do?

This PR provides some workarounds to use DeepSpeed in finetuning. In fact, DeepSpeed cannot work with pytorch-lightning completely because its parameter loading and storing don't work. So this PR added some fine-tuning strategies whose parameter loading and storing are omitted.

Fixes #1249

Before submitting

  • [x] Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • [x] Did you read the contributor guideline, Pull Request section?
  • [x] Did you make sure your PR does only one thing, instead of bundling different changes together?
  • [x] Did you make sure to update the documentation with your changes?
  • [x] Did you write any new necessary tests? [not needed for typos/docs]
  • [x] Did you verify new and existing tests pass locally with your changes?
  • [x] If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

  • [x] Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

ar90n avatar Jul 03 '22 05:07 ar90n

Codecov Report

Merging #1377 (acf3ae3) into master (0253d71) will decrease coverage by 0.01%. The diff coverage is 77.77%.

@@            Coverage Diff             @@
##           master    #1377      +/-   ##
==========================================
- Coverage   92.90%   92.88%   -0.02%     
==========================================
  Files         286      286              
  Lines       12874    12891      +17     
==========================================
+ Hits        11960    11974      +14     
- Misses        914      917       +3     
Flag Coverage Δ
unittests 92.88% <77.77%> (-0.02%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
flash/core/finetuning.py 88.23% <76.47%> (-2.47%) :arrow_down:
flash/core/utilities/imports.py 91.47% <100.00%> (+0.04%) :arrow_up:
flash/text/question_answering/model.py 93.87% <0.00%> (-0.69%) :arrow_down:
flash/core/serve/dag/task.py 97.88% <0.00%> (+1.05%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov[bot] avatar Jul 03 '22 05:07 codecov[bot]

I don't know how to fix Lightning-AI.lightning-flash (Examples) jobs. Could you give me some help?

ar90n avatar Jul 03 '22 11:07 ar90n

Hi @krshrimali Thanks for your review and suggestions! It's so helpful for me. Because my English is poor. I'm so glad that this PR was approved.

ar90n avatar Jul 22 '22 11:07 ar90n

I checked the reason for Lightning-AI.lightning-flash (Examples) and found the follwoings.

RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I'm not familiar with them. It seems that the test process was terminated by its timeout. In my local environment, this test passes. Please some help to solve this issue.

ar90n avatar Jul 22 '22 14:07 ar90n

I checked the reason for Lightning-AI.lightning-flash (Examples) and found the follwoings.

RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I'm not familiar with them. It seems that the test process was terminated by its timeout. In my local environment, this test passes. Please some help to solve this issue.

Hi, @ar90n - Please don't worry about it. I'll have to check once on my personal GPU, sometimes these failures can be flaky (because of resources not being available or anything else). I'll merge it once I'm done testing, but this is good to go! 🎉

krshrimali avatar Jul 22 '22 14:07 krshrimali

The test passes locally on my GPU, let's merge this and monitor the CI. In case an alarm is raised, I'll attempt to fix it. Thanks, @ar90n for your hard-work and patience with this PR. 🎉

krshrimali avatar Jul 27 '22 04:07 krshrimali

Just added the CHANGELOG entry, let's wait for the CI, and push it ASAP. <3

krshrimali avatar Aug 02 '22 04:08 krshrimali

@ar90n - FYI, it took us some time to fix the CI, sorry for that. @ethanwharris is currently OOO for this week, so whenever he is back, he'll help merge this. 🎉 Thank you for your contribution, and patience.

krshrimali avatar Aug 26 '22 06:08 krshrimali