storage
storage copied to clipboard
checkpointing with reads only shows warnings
Running only the read part
mlpstorage checkpointing run --hosts 10.57.205.101,10.57.205.102 --model llama3-70b --client-host-memory-in-gb 220 --num-processes 8 --checkpoint-folder /mnt/host_checkpointing --results-dir checkpoint_test_4_hosts_llama3-70b --num-checkpoints-read 1 --num-checkpoints-write 0 --allow-run-as-root
Test seems to be succeeding, but with some warnings. Is this accepted as a successful run @FileSystemGuy ?
[OUTPUT] 2025-07-02T16:54:20.541033 Running DLIO [Checkpointing] with 8 process(es)
[OUTPUT] 2025-07-02T16:54:20.543562 Performing subset checkpointing: 8 of 64
[OUTPUT] 2025-07-02T16:54:20.544388 Total number of parameters in the model: 69882617856
[OUTPUT] 2025-07-02T16:54:40.785006 Model size: 16.272964 GB (subset)
[OUTPUT] 2025-07-02T16:54:40.785131 Optimizer state size: 97.628551 GB (subset)
[OUTPUT] 2025-07-02T16:54:40.785164 Total checkpoint size: 113.901516 GB (subset)
[OUTPUT] 2025-07-02T16:54:40.785789 Checkpointing read started
[OUTPUT] 2025-07-02T16:54:45.786175 Starting loading checkpoint 1 after total step 1 for epoch 1
[OUTPUT] 2025-07-02T16:54:46.984831 Loaded model checkpoint in 1.1977 seconds
[OUTPUT] 2025-07-02T16:54:53.514932 Loaded optimizer checkpoint in 6.5299 seconds
[OUTPUT] 2025-07-02T16:54:53.515092 Finished loading checkpoint 1 for epoch 1 in 7.7289 s; Throughput: 14.7371 GB/s
[OUTPUT] 2025-07-02T16:54:53.517204 Checkpointing write started
/home/nutanix/.venvs/myenv/lib/python3.12/site-packages/numpy/_core/fromnumeric.py:3904: RuntimeWarning: Mean of empty slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/home/nutanix/.venvs/myenv/lib/python3.12/site-packages/numpy/_core/_methods.py:147: RuntimeWarning: invalid value encountered in scalar divide
ret = ret.dtype.type(ret / rcount)
/home/nutanix/.venvs/myenv/lib/python3.12/site-packages/numpy/_core/_methods.py:227: RuntimeWarning: Degrees of freedom <= 0 for slice
ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/home/nutanix/.venvs/myenv/lib/python3.12/site-packages/numpy/_core/_methods.py:184: RuntimeWarning: invalid value encountered in divide
arrmean = um.true_divide(arrmean, div, out=arrmean,
/home/nutanix/.venvs/myenv/lib/python3.12/site-packages/numpy/_core/_methods.py:219: RuntimeWarning: invalid value encountered in scalar divide
ret = ret.dtype.type(ret / rcount)
[OUTPUT] 2025-07-02T16:54:53.562293 Saved outputs in /home/nutanix/SUBMISSION/checkpoint_test_4_hosts_llama3-70b/checkpointing/llama3-70b/20250702_165416
[OUTPUT] Averaged metric over all steps/epochs
[METRIC] ==========================================================
[METRIC] Number of Simulated Accelerators: 8
[METRIC] Checkpoint load duration (seconds): 7.7289 (0.0000)
[METRIC] Checkpoint load I/O Throughput (GB/second): 14.7371 (0.0000)
[METRIC] ==========================================================
[OUTPUT] 2025-07-02T16:54:53.562873 outputs saved in RANKID_output.json