lightning-flash
lightning-flash copied to clipboard
Add finetuning strategies for DeepSpeed
What does this PR do?
This PR provides some workarounds to use DeepSpeed in finetuning. In fact, DeepSpeed cannot work with pytorch-lightning completely because its parameter loading and storing don't work. So this PR added some fine-tuning strategies whose parameter loading and storing are omitted.
Fixes #1249
Before submitting
- [x] Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
- [x] Did you read the contributor guideline, Pull Request section?
- [x] Did you make sure your PR does only one thing, instead of bundling different changes together?
- [x] Did you make sure to update the documentation with your changes?
- [x] Did you write any new necessary tests? [not needed for typos/docs]
- [x] Did you verify new and existing tests pass locally with your changes?
- [x] If you made a notable change (that affects users), did you update the CHANGELOG?
PR review
- [x] Is this pull request ready for review? (if not, please submit in draft mode)
Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃
Codecov Report
Merging #1377 (acf3ae3) into master (0253d71) will decrease coverage by
0.01%. The diff coverage is77.77%.
@@ Coverage Diff @@
## master #1377 +/- ##
==========================================
- Coverage 92.90% 92.88% -0.02%
==========================================
Files 286 286
Lines 12874 12891 +17
==========================================
+ Hits 11960 11974 +14
- Misses 914 917 +3
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 92.88% <77.77%> (-0.02%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| flash/core/finetuning.py | 88.23% <76.47%> (-2.47%) |
:arrow_down: |
| flash/core/utilities/imports.py | 91.47% <100.00%> (+0.04%) |
:arrow_up: |
| flash/text/question_answering/model.py | 93.87% <0.00%> (-0.69%) |
:arrow_down: |
| flash/core/serve/dag/task.py | 97.88% <0.00%> (+1.05%) |
:arrow_up: |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
I don't know how to fix Lightning-AI.lightning-flash (Examples) jobs. Could you give me some help?
Hi @krshrimali Thanks for your review and suggestions! It's so helpful for me. Because my English is poor. I'm so glad that this PR was approved.
I checked the reason for Lightning-AI.lightning-flash (Examples) and found the follwoings.
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
I'm not familiar with them. It seems that the test process was terminated by its timeout. In my local environment, this test passes. Please some help to solve this issue.
I checked the reason for
Lightning-AI.lightning-flash (Examples)and found the follwoings.RuntimeError: CUDA error: the launch timed out and was terminated CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.I'm not familiar with them. It seems that the test process was terminated by its timeout. In my local environment, this test passes. Please some help to solve this issue.
Hi, @ar90n - Please don't worry about it. I'll have to check once on my personal GPU, sometimes these failures can be flaky (because of resources not being available or anything else). I'll merge it once I'm done testing, but this is good to go! 🎉
The test passes locally on my GPU, let's merge this and monitor the CI. In case an alarm is raised, I'll attempt to fix it. Thanks, @ar90n for your hard-work and patience with this PR. 🎉
Just added the CHANGELOG entry, let's wait for the CI, and push it ASAP. <3
@ar90n - FYI, it took us some time to fix the CI, sorry for that. @ethanwharris is currently OOO for this week, so whenever he is back, he'll help merge this. 🎉 Thank you for your contribution, and patience.