PaddleNLP
PaddleNLP copied to clipboard
Allow to pre alloc memory for pretraining for better memory use.
PR types
Others
PR changes
Others
Description
Llama-2 70B模型,训练策略tp4pp8-vpp5-mbs1-acc32(开启sp),不开启release_grads选项时能稳定训练50个step:
开启release_grads后,容易在训练若干个step后OOM,原因是release_grads功能会在每个step后释放梯度所占用的空间、在下一个step重新分配,增加了显存操作的次数,从而容易引起显存碎片。通过添加显存预分配功能(pre_alloc_memory),即预先为训练分配好一块大的显存空间,可以避免该问题。
Thanks for your contribution!
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 55.81%. Comparing base (
5619cc3) to head (548db29). Report is 925 commits behind head on develop.
:x: Your project check has failed because the head coverage (55.81%) is below the target coverage (58.00%). You can increase the head coverage or adjust the target coverage.
Additional details and impacted files
@@ Coverage Diff @@
## develop #8600 +/- ##
========================================
Coverage 55.81% 55.81%
========================================
Files 620 620
Lines 96599 96599
========================================
Hits 53917 53917
Misses 42682 42682
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。
This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。
This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。
This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。