What does this PR do ?

Collect CUDA memory snapshot based on the previous commit (#9096 ), and further analyze which parts of the model contribute to the total memory footprint.
The memory profile will generate two pickle file, one for weight, one for activation. The user can load the file in the below page: https://pytorch.org/memory_viz
If out-of-memory (CUDA OOM) occurs, the tool will capture the snapshot before OOM occurs, and generate the pickle file.
With knobs analysis_enabled: True, the memory profile analyzer will generate two csv files each for weight/activation/OOM. The output csv files includes:
1. Weight
  - alive_memory_weight.csv
  - group_by_alloc_frames_weight.csv
2. Activation
  - alive_memory_memory.csv
  - group_by_alloc_frames_memory.csv
3. OOM
  - alive_memory_oom.csv
  - group_by_alloc_frames_oom.csv

Changelog

Fix some issues of previous memory profile
- batch_idx mismatch issue.
- max_entries is too small, which makes the generated snapshot easily truncated.
Add weight memory capturing.
Add OOM case support.
Added the option to enable further analysis to the generated memory snapshot file. The analyzer finds the peak memory of the snapshot, and generate two csv files, including
1. All the alive memory buffers at that peak moment
2. Group them by allocation frames, showing the relationship between model layer and its corresponding memory footprint.

Usage

Add the below knobs to the yaml run config.

# Memory Profile
memory_profile:                                                                      
   enabled: true                                                                      
   start_step: 1                                                                      
   end_step: 3                                                                        
   rank: 0                                                                            
   output_path: <path/to/out_file>
   analysis_enabled: true

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR. To re-run CI remove and add the label again. To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

[ ] Make sure you read and followed Contributor guidelines
[ ] Did you write any new necessary tests?
[ ] Did you add or update any necessary documentation?
[ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

[ ] New Feature
[ ] Bugfix
[ ] Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Jul 24 '24 05:07 tonyjie

One more request @tonyjie , can you rebase to the latest main & use --signoff to your commits and push again ? Otherwise CI won't play.

Aug 01 '24 06:08 akoumpa

One more request @tonyjie , can you rebase to the latest main & use --signoff to your commits and push again ? Otherwise CI won't play.

I did that but the CI keeps failing. Is there any issue with that? I also updated the code based on the review. Please have a look and all the reply. Thanks!

Aug 06 '24 23:08 tonyjie

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

Sep 05 '24 01:09 github-actions[bot]

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

Sep 24 '24 01:09 github-actions[bot]

This PR was closed because it has been inactive for 7 days since being marked as stale.

Oct 01 '24 02:10 github-actions[bot]

This PR seems to have slipped through. Should we merge it? @ericharper @titu1994

Oct 22 '24 14:10 pzelasko

CUDA Memory Profile Analyzer

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information