PaddleNLP icon indicating copy to clipboard operation
PaddleNLP copied to clipboard

Allow to pre alloc memory for pretraining for better memory use.

Open Xreki opened this issue 1 year ago • 3 comments

PR types

Others

PR changes

Others

Description

Llama-2 70B模型,训练策略tp4pp8-vpp5-mbs1-acc32(开启sp),不开启release_grads选项时能稳定训练50个step: image

开启release_grads后,容易在训练若干个step后OOM,原因是release_grads功能会在每个step后释放梯度所占用的空间、在下一个step重新分配,增加了显存操作的次数,从而容易引起显存碎片。通过添加显存预分配功能(pre_alloc_memory),即预先为训练分配好一块大的显存空间,可以避免该问题。

Xreki avatar Jun 13 '24 09:06 Xreki

Thanks for your contribution!

paddle-bot[bot] avatar Jun 13 '24 09:06 paddle-bot[bot]

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 55.81%. Comparing base (5619cc3) to head (548db29). Report is 925 commits behind head on develop.

:x: Your project check has failed because the head coverage (55.81%) is below the target coverage (58.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #8600   +/-   ##
========================================
  Coverage    55.81%   55.81%           
========================================
  Files          620      620           
  Lines        96599    96599           
========================================
  Hits         53917    53917           
  Misses       42682    42682           

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jun 13 '24 10:06 codecov[bot]

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。

github-actions[bot] avatar Aug 20 '24 00:08 github-actions[bot]

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Oct 14 '24 07:10 CLAassistant

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。

github-actions[bot] avatar Dec 14 '24 00:12 github-actions[bot]

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。

github-actions[bot] avatar Feb 14 '25 00:02 github-actions[bot]

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。

github-actions[bot] avatar May 22 '25 00:05 github-actions[bot]