Memory requirements for checkpointing workloads are not clear and possibly buggy.

Open sajikvr opened this issue 6 months ago • 0 comments

Trying to run llama-405b checkpointing with 32 hosts, each with 220G memory. As per the datasize command, the per host (8 ranks per host) memory needed is around 90G

(myenv) nutanix@clientvm001:~$ mlpstorage checkpointing datasize --hosts 127.0.0.1,127.0.0.2 --client-host-memory-in-gb 256 --model llama3-405b --num-processes 512 --checkpoint-folder /mnt/test_data --results-dir mlpstorage_test_results
Hosts is: ['127.0.0.1,127.0.0.2']
Hosts is: ['127.0.0.1', '127.0.0.2']
2025-07-03 00:30:46|STATUS: Benchmark results directory: mlpstorage_test_results/checkpointing/llama3-405b/20250703_003046
2025-07-03 00:30:46|INFO: Found benchmark run: checkpointing_datasize_llama3-405b_20250703_003046
2025-07-03 00:30:46|STATUS: Verifying benchmark run for checkpointing_datasize_llama3-405b_20250703_003046
2025-07-03 00:30:46|STATUS: Benchmark run qualifies for CLOSED category ([RunID(program='checkpointing', command='datasize', model='llama3-405b', run_datetime='20250703_003046')])
2025-07-03 00:30:46|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use --open or --closed to specify a configuration.
2025-07-03 00:30:46|STATUS: Instantiated the Checkpointing Benchmark...
2025-07-03 00:30:46|RESULT: Total GB required per rank:
                Rank 0: 11.80 GB
                Rank 1: 11.80 GB
                Rank 2: 11.80 GB
                Rank 3: 11.80 GB
                Rank 4: 11.80 GB
                Rank 5: 11.80 GB
                Rank 6: 11.80 GB
                Rank 7: 11.80 GB
                Rank 8: 11.80 GB

But the test fails with an error that looks like is implying lack of sufficient memory (even trying to run 2 processes per host fails, 1 process per host succeeds)

export PMIX_MCA_gds=hash;mlpstorage checkpointing run --hosts 10.57.205.101,10.57.205.102,10.57.205.103,10.57.205.104,10.57.205.105,10.57.205.106,10.57.205.107,10.57.205.108,10.57.205.109,10.57.205.110,10.57.205.111,10.57.205.112,10.57.205.113,10.57.205.114,10.57.205.115,10.57.205.116,10.57.205.117,10.57.205.118,10.57.205.119,10.57.205.120,10.57.205.121,10.57.205.122,10.57.205.123,10.57.205.124,10.57.205.125,10.57.205.126,10.57.205.127,10.57.205.128,10.57.205.129,10.57.205.130,10.57.205.131,10.57.205.132  --model llama3-405b  --client-host-memory-in-gb 220  --num-processes 64 --checkpoint-folder /mnt/host_checkpointing  --results-dir checkpt_405b_32_hosts --num-checkpoints-read 0 --num-checkpoints-write 10 --allow-run-as-root –closed

….

[OUTPUT] 2025-07-02T21:28:26.324781 Running DLIO [Checkpointing] with 64 process(es)

Error executing job with overrides: ['workload=llama3_405b', '++workload.checkpoint.mode=subset', '++workload.model.parallelism.data=2', '++workload.checkpoint.num_checkpoints_read=0', '++workload.checkpoint.num_checkpoints_write=10', '++workload.checkpoint.checkpoint_folder=/mnt/host_checkpointing/llama3-405b']

Traceback (most recent call last):

  File "/home/nutanix/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 459, in run_benchmark

    benchmark = DLIOBenchmark(cfg['workload'])

                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/nutanix/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 153, in __init__

    self.stats = StatsCounter()

                 ^^^^^^^^^^^^^^

  File "/home/nutanix/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/utils/statscounter.py", line 105, in __init__

    self.MPI.comm().Reduce(host_memory, host_memory_agg, op=MPI.SUM, root=0)

  File "src/mpi4py/MPI.src/Comm.pyx", line 1100, in mpi4py.MPI.Comm.Reduce

mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated

Jul 03 '25 02:07 sajikvr