CUDA Memory Profile Analyzer
What does this PR do ?
- Collect CUDA memory snapshot based on the previous commit (#9096 ), and further analyze which parts of the model contribute to the total memory footprint.
- The memory profile will generate two
pickle file, one for weight, one for activation. The user can load the file in the below page: https://pytorch.org/memory_viz - If out-of-memory (CUDA OOM) occurs, the tool will capture the snapshot before OOM occurs, and generate the
pickle file. - With knobs
analysis_enabled: True, the memory profile analyzer will generate two csv files each for weight/activation/OOM. The output csv files includes:- Weight
-
alive_memory_weight.csv -
group_by_alloc_frames_weight.csv
-
- Activation
-
alive_memory_memory.csv -
group_by_alloc_frames_memory.csv
-
- OOM
-
alive_memory_oom.csv -
group_by_alloc_frames_oom.csv
-
- Weight
Changelog
- Fix some issues of previous memory profile
-
batch_idxmismatch issue. -
max_entriesis too small, which makes the generated snapshot easily truncated.
-
- Add weight memory capturing.
- Add OOM case support.
- Added the option to enable further analysis to the generated memory snapshot file. The analyzer finds the peak memory of the snapshot, and generate two csv files, including
- All the alive memory buffers at that peak moment
- Group them by allocation frames, showing the relationship between model layer and its corresponding memory footprint.
Usage
- Add the below knobs to the yaml run config.
# Memory Profile
memory_profile:
enabled: true
start_step: 1
end_step: 3
rank: 0
output_path: <path/to/out_file>
analysis_enabled: true
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR. To re-run CI remove and add the label again. To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
- [ ] Make sure you read and followed Contributor guidelines
- [ ] Did you write any new necessary tests?
- [ ] Did you add or update any necessary documentation?
- [ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?
PR Type:
- [ ] New Feature
- [ ] Bugfix
- [ ] Documentation
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information
- Related to # (issue)
One more request @tonyjie , can you rebase to the latest main & use --signoff to your commits and push again ? Otherwise CI won't play.
One more request @tonyjie , can you rebase to the latest main & use --signoff to your commits and push again ? Otherwise CI won't play.
I did that but the CI keeps failing. Is there any issue with that? I also updated the code based on the review. Please have a look and all the reply. Thanks!
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.
This PR was closed because it has been inactive for 7 days since being marked as stale.
This PR seems to have slipped through. Should we merge it? @ericharper @titu1994