PaddleNLP Allow to pre alloc memory for pretraining for better memory use.

PR types

Others

PR changes

Others

Description

Llama-2 70B模型，训练策略tp4pp8-vpp5-mbs1-acc32（开启sp），不开启release_grads选项时能稳定训练50个step：

开启release_grads后，容易在训练若干个step后OOM，原因是release_grads功能会在每个step后释放梯度所占用的空间、在下一个step重新分配，增加了显存操作的次数，从而容易引起显存碎片。通过添加显存预分配功能（pre_alloc_memory），即预先为训练分配好一块大的显存空间，可以避免该问题。

Jun 13 '24 09:06 Xreki

Thanks for your contribution!

Jun 13 '24 09:06 paddle-bot[bot]

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 55.81%. Comparing base (5619cc3) to head (548db29). Report is 925 commits behind head on develop.

:x: Your project check has failed because the head coverage (55.81%) is below the target coverage (58.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #8600   +/-   ##
========================================
  Coverage    55.81%   55.81%           
========================================
  Files          620      620           
  Lines        96599    96599           
========================================
  Hits         53917    53917           
  Misses       42682    42682

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Jun 13 '24 10:06 codecov[bot]

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动，被标记为stale。

Aug 20 '24 00:08 github-actions[bot]

All committers have signed the CLA.

Oct 14 '24 07:10 CLAassistant

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动，被标记为stale。

Dec 14 '24 00:12 github-actions[bot]

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动，被标记为stale。

Feb 14 '25 00:02 github-actions[bot]

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动，被标记为stale。

May 22 '25 00:05 github-actions[bot]

PaddleNLP PaddleNLP copied to clipboard

Allow to pre alloc memory for pretraining for better memory use.

PR types

PR changes

Description

Codecov Report

PaddleNLP
PaddleNLP copied to clipboard