checkpointing with llama3b-450b hangs
Using 16 or 32 hosts, each with 256G memory, test starts and each host has dlio_benchmark processes running, but no progress on the test itself. 8b and 70b models runs fine with same storage. I did see the issue#176 while running, to workaround, use the workaround that I used for resnet50, not sure if that is not sufficient in this case.
command used
export PMIX_MCA_gds=hash;mlpstorage checkpointing run --hosts 10.57.205.101,10.57.205.102,10.57.205.103,10.57.205.104,10.57.205.105,10.57.205.106,10.57.205.107,10.57.205.108,10.57.205.109,10.57.205.110,10.57.205.111,10.57.205.112,10.57.205.113,10.57.205.114,10.57.205.115,10.57.205.116,10.57.205.117,10.57.205.118,10.57.205.119,10.57.205.120,10.57.205.121,10.57.205.122,10.57.205.123,10.57.205.124,10.57.205.125,10.57.205.126,10.57.205.127,10.57.205.128,10.57.205.129,10.57.205.130,10.57.205.131,10.57.205.132 --model llama3-405b --client-host-memory-in-gb 256 --num-processes 512 --checkpoint-folder /mnt/host_checkpointing --results-dir checkpoint_test_32_hosts_llama3-450b --num-checkpoints-read 10 --num-checkpoints-write 10 --allow-run-as-root --closed
each host top output
20186 nutanix 20 0 6573272 965976 541904 R 100.3 0.4 7:09.29 dlio_benchmark
20192 nutanix 20 0 6573396 968360 544208 R 100.3 0.4 7:09.47 dlio_benchmark
20176 nutanix 20 0 6573256 969324 544720 R 100.0 0.4 7:09.37 dlio_benchmark
20179 nutanix 20 0 6573428 965516 541392 R 100.0 0.4 7:09.27 dlio_benchmark
20180 nutanix 20 0 6573232 968504 544208 R 100.0 0.4 7:09.28 dlio_benchmark
20182 nutanix 20 0 6573400 967600 542928 R 100.0 0.4 7:09.18 dlio_benchmark
20184 nutanix 20 0 6573276 968124 543864 R 100.0 0.4 7:09.28 dlio_benchmark
20185 nutanix 20 0 6573272 969036 544976 R 100.0 0.4 7:09.23 dlio_benchmark
20189 nutanix 20 0 6573272 968060 543956 R 100.0 0.4 7:09.22 dlio_benchmark
20191 nutanix 20 0 6573264 969756 545488 R 100.0 0.4 7:09.20 dlio_benchmark
20193 nutanix 20 0 6573276 966452 541904 R 100.0 0.4 7:09.32 dlio_benchmark
20194 nutanix 20 0 6573388 967844 543440 R 100.0 0.4 7:09.28 dlio_benchmark
20195 nutanix 20 0 6573272 974864 550324 R 100.0 0.4 7:09.19 dlio_benchmark
20196 nutanix 20 0 6573396 968604 544144 R 100.0 0.4 7:09.14 dlio_benchmark
20197 nutanix 20 0 6573396 975180 550864 R 100.0 0.4 7:09.20 dlio_benchmark
20198 nutanix 20 0 6573260 970444 546260 R 100.0 0.4 7:09.18 dlio_benchmark
20199 nutanix 20 0 6573300 972124 547580 R 100.0 0.4 7:09.49 dlio_benchmark
20200 nutanix 20 0 6573364 966712 542416 R 100.0 0.4 7:09.23 dlio_benchmark
20201 nutanix 20 0 6573360 975428 551120 R 100.0 0.4 7:09.58 dlio_benchmark
20202 nutanix 20 0 6573388 968000 543952 R 100.0 0.4 7:09.26 dlio_benchmark