storage icon indicating copy to clipboard operation
storage copied to clipboard

checkpointing with llama3b-450b hangs

Open sajikvr opened this issue 6 months ago • 0 comments

Using 16 or 32 hosts, each with 256G memory, test starts and each host has dlio_benchmark processes running, but no progress on the test itself. 8b and 70b models runs fine with same storage. I did see the issue#176 while running, to workaround, use the workaround that I used for resnet50, not sure if that is not sufficient in this case.

command used export PMIX_MCA_gds=hash;mlpstorage checkpointing run --hosts 10.57.205.101,10.57.205.102,10.57.205.103,10.57.205.104,10.57.205.105,10.57.205.106,10.57.205.107,10.57.205.108,10.57.205.109,10.57.205.110,10.57.205.111,10.57.205.112,10.57.205.113,10.57.205.114,10.57.205.115,10.57.205.116,10.57.205.117,10.57.205.118,10.57.205.119,10.57.205.120,10.57.205.121,10.57.205.122,10.57.205.123,10.57.205.124,10.57.205.125,10.57.205.126,10.57.205.127,10.57.205.128,10.57.205.129,10.57.205.130,10.57.205.131,10.57.205.132 --model llama3-405b --client-host-memory-in-gb 256 --num-processes 512 --checkpoint-folder /mnt/host_checkpointing --results-dir checkpoint_test_32_hosts_llama3-450b --num-checkpoints-read 10 --num-checkpoints-write 10 --allow-run-as-root --closed

each host top output

  20186 nutanix   20   0 6573272 965976 541904 R 100.3   0.4   7:09.29 dlio_benchmark
  20192 nutanix   20   0 6573396 968360 544208 R 100.3   0.4   7:09.47 dlio_benchmark
  20176 nutanix   20   0 6573256 969324 544720 R 100.0   0.4   7:09.37 dlio_benchmark
  20179 nutanix   20   0 6573428 965516 541392 R 100.0   0.4   7:09.27 dlio_benchmark
  20180 nutanix   20   0 6573232 968504 544208 R 100.0   0.4   7:09.28 dlio_benchmark
  20182 nutanix   20   0 6573400 967600 542928 R 100.0   0.4   7:09.18 dlio_benchmark
  20184 nutanix   20   0 6573276 968124 543864 R 100.0   0.4   7:09.28 dlio_benchmark
  20185 nutanix   20   0 6573272 969036 544976 R 100.0   0.4   7:09.23 dlio_benchmark
  20189 nutanix   20   0 6573272 968060 543956 R 100.0   0.4   7:09.22 dlio_benchmark
  20191 nutanix   20   0 6573264 969756 545488 R 100.0   0.4   7:09.20 dlio_benchmark
  20193 nutanix   20   0 6573276 966452 541904 R 100.0   0.4   7:09.32 dlio_benchmark
  20194 nutanix   20   0 6573388 967844 543440 R 100.0   0.4   7:09.28 dlio_benchmark
  20195 nutanix   20   0 6573272 974864 550324 R 100.0   0.4   7:09.19 dlio_benchmark
  20196 nutanix   20   0 6573396 968604 544144 R 100.0   0.4   7:09.14 dlio_benchmark
  20197 nutanix   20   0 6573396 975180 550864 R 100.0   0.4   7:09.20 dlio_benchmark
  20198 nutanix   20   0 6573260 970444 546260 R 100.0   0.4   7:09.18 dlio_benchmark
  20199 nutanix   20   0 6573300 972124 547580 R 100.0   0.4   7:09.49 dlio_benchmark
  20200 nutanix   20   0 6573364 966712 542416 R 100.0   0.4   7:09.23 dlio_benchmark
  20201 nutanix   20   0 6573360 975428 551120 R 100.0   0.4   7:09.58 dlio_benchmark
  20202 nutanix   20   0 6573388 968000 543952 R 100.0   0.4   7:09.26 dlio_benchmark

sajikvr avatar Jul 02 '25 06:07 sajikvr