NeMo Out of RAM using 24.07 container

I have tried training Llama 2, Llama 3.1 and Mistral models using the new 24.07 container but the process errors out after 200 or 300 steps. It always happens when a checkpoint is being created. I have ensured to set cpu_offloading: false and use_cpu_initialization: false.

The training works well when using the 24.05 version and the exact same configuration. So I don't understand why this is happening. I would appreciate some help on this matter.

These are the system specifications (I use 8 nodes for training):

2x Intel Xeon Platinum 8460Y+ 40C 2.3GHz (80 cores per node)
4x NVIDIA Hopper H100 64GB HBM2
16x DIMM 32GB 4800MHz DDR5 (512GB main memory per node)

And this is the error I get in SLURM:

srun: error: as01r3b17: task 3: Out Of Memory srun: Terminating StepId=4777403.0 slurmstepd: error: *** STEP 4777403.0 ON as01r3b17 CANCELLED AT 2024-08-16T19:36:54 *** slurmstepd: error: Detected 1 oom_kill event in StepId=4777403.0. Some of the step tasks have been OOM Killed. /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 23 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 23 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' slurmstepd: error: Detected 1 oom_kill event in StepId=4777403.0. Some of the step tasks have been OOM Killed.

Aug 18 '24 09:08 aimarz

@aimarz can you post a reproducer and the whole log?

Aug 21 '24 16:08 mikolajblaz

Yes, this seems similar to what I've observed on 15B / 24.07. I've noticed that locally, a 15B/TP4 checkpointing (checkpoint size is 205GB) reserves 70GB of process memory and 290GB of CPU/buffered disk IO memory (360GB total). And when the run continues, sometimes I see OOM/sigkill.

After running (outside docker) most of that memory gets released: free && sync && echo 3 > /proc/sys/vm/drop_caches && free

Aug 21 '24 17:08 dchichkov

In my case it happens even when TP_size=1 and PP_size=1, that is, using only DP.

@mikolajblaz here I attach the full logs as well as the configuration file used in the experiment (llama 2 7B continual pretraining using NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py).

issue_config.yaml.txt issue_log_ERR.txt issue_log_OUT.txt

Aug 21 '24 17:08 aimarz

So far, no more OOMs/sigkill, when free memory target is set to 64GB. But I was only seeing the crash sporadically.

sysctl -w vm.min_free_kbytes=$((64 * 1024 * 1024))

Aug 21 '24 23:08 dchichkov

Hello, I'm facing a similar issue. @dchichkov would you mind sharing the output of:

scontrol show node <node-name> | grep CfgTRES

In our case it's:

CfgTRES=cpu=192,mem=1000000M,billing=192,gres/gpu=8

However when I check the amount of memory available on that node, it's 1056270564 kB:

cat /proc/meminfo | grep MemTotal

is showing:

MemTotal:       1056270564 kB

I suspect this could be simply a mismatch between what's configured in Slurm and the total amount of memory on the node. And that by setting free memory target to 64GB you simply force the OS to keep the allocated memory below what's configured in Slurm.

I'm also using 2407 NeMo image.

After setting target memory to 64GB it also solved the problem for me.

Sep 06 '24 14:09 evellasques

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Oct 07 '24 02:10 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

Oct 15 '24 01:10 github-actions[bot]