llm-foundry
llm-foundry copied to clipboard
Composer crashes when attempting to load sharded checkpoint
When attempting load a sharded checkpoint, we (@prigoyal and I) hit the following error:
595 │ /usr/lib/python3/dist-packages/composer/utils/checkpoint.py:287 in │
596 │ load_checkpoint │
597 │ │
598 │ 284 │ │ using_legacy_sharded = is_checkpoint_legacy_sharded(object_st │
599 │ 285 │ │
600 │ 286 │ if state.fsdp_elastic_sharded_enabled and not using_legacy_sharde │
601 │ ❱ 287 │ │ rng_state_dicts = load_sharded_checkpoint( │
602 │ 288 │ │ │ source_path=path, │
603 │ 289 │ │ │ state=state, │
604 │ 290 │ │ │ logger=logger, │
605 │ │
606 │ /usr/lib/python3/dist-packages/composer/utils/checkpoint.py:530 in │
607 │ load_sharded_checkpoint │
608 │ │
609 │ 527 │ │ │ │
610 │ 528 │ │ │ # 2. Optionally load optimizer │
611 │ 529 │ │ │ if not load_weights_only: │
612 │ ❱ 530 │ │ │ │ optim_state = load_sharded_optimizer_state_dict(model │
613 │ 531 │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ optim │
614 │ 532 │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ stora │
615 │ 533 │ │ │ │ state.load_optim_state(optim_state) │
616 │ │
617 │ /usr/lib/python3/dist-packages/torch/distributed/checkpoint/optimizer.py:264 │
618 │ in load_sharded_optimizer_state_dict │
619 │ │
620 │ 261 │ """ │
621 │ 262 │ metadata = storage_reader.read_metadata() │
622 │ 263 │ │
623 │ ❱ 264 │ layout_specs, dp_pg = _get_state_dict_2d_layout(model_state_dict) │
624 │ 265 │ dp_pg_device_type = dist.distributed_c10d._get_pg_default_device(d │
625 │ 266 │ device_module = _get_device_module(dp_pg_device_type) │
626 │ 267 │
627 │ │
628 │ /usr/lib/python3/dist-packages/torch/distributed/checkpoint/optimizer.py:128 │
629 │ in _get_state_dict_2d_layout │
630 │ │
631 │ 125 │ specs: STATE_DICT_2D_LAYOUT = {} │
632 │ 126 │ dp_pg: Optional[dist.ProcessGroup] = None │
633 │ 127 │ for key, value in state_dict.items(): │
634 │ ❱ 128 │ │ specs[key] = (None, value.size()) │
635 │ 129 │ │ if _is_nested_tensor(value): │
636 │ 130 │ │ │ assert ( │
637 │ 131 │ │ │ │ len(value.local_shards()) == 1 │
638 ╰──────────────────────────────────────────────────────────────────────────────╯
639 AttributeError: '_io.BytesIO' object has no attribute 'size'
Environment
0: Collecting system information...
0: ---------------------------------
0: System Environment Report
0: Created: 2024-02-27 02:31:05 UTC
0: ---------------------------------
0:
0: PyTorch information
0: -------------------
0: PyTorch version: 2.1.0+cu121
0: Is debug build: False
0: CUDA used to build PyTorch: 12.1
0: ROCM used to build PyTorch: N/A
0:
0: OS: Ubuntu 20.04.6 LTS (x86_64)
0: GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
0: Clang version: Could not collect
0: CMake version: version 3.16.3
0: Libc version: glibc-2.31
0:
0: Python version: 3.10.13 (main, Aug 25 2023, 13:20:03) [GCC 9.4.0] (64-bit runtime)
0: Python platform: Linux-5.15.0-1047-aws-x86_64-with-glibc2.31
0: Is CUDA available: True
0: CUDA runtime version: 12.1.105
0: CUDA_MODULE_LOADING set to: LAZY
0: GPU models and configuration:
0: GPU 0: NVIDIA H100 80GB HBM3
0: GPU 1: NVIDIA H100 80GB HBM3
0: GPU 2: NVIDIA H100 80GB HBM3
0: GPU 3: NVIDIA H100 80GB HBM3
0: GPU 4: NVIDIA H100 80GB HBM3
0: GPU 5: NVIDIA H100 80GB HBM3
0: GPU 6: NVIDIA H100 80GB HBM3
0: GPU 7: NVIDIA H100 80GB HBM3
0:
0: Nvidia driver version: 535.104.12
0: cuDNN version: Probably one of the following:
0: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
0: /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
0: /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
0: /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
0: /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
0: /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
0: /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
0: HIP runtime version: N/A
0: MIOpen runtime version: N/A
0: Is XNNPACK available: True
0:
0: CPU:
0: Architecture: x86_64
0: CPU op-mode(s): 32-bit, 64-bit
0: Byte Order: Little Endian
0: Address sizes: 48 bits physical, 48 bits virtual
0: CPU(s): 192
0: On-line CPU(s) list: 0-191
0: Thread(s) per core: 2
0: Core(s) per socket: 48
0: Socket(s): 2
0: NUMA node(s): 2
0: Vendor ID: AuthenticAMD
0: CPU family: 25
0: Model: 1
0: Model name: AMD EPYC 7R13 Processor
0: Stepping: 1
0: CPU MHz: 2650.000
0: BogoMIPS: 5300.00
0: Hypervisor vendor: KVM
0: Virtualization type: full
0: L1d cache: 3 MiB
0: L1i cache: 3 MiB
0: L2 cache: 48 MiB
0: L3 cache: 384 MiB
0: NUMA node0 CPU(s): 0-47,96-143
0: NUMA node1 CPU(s): 48-95,144-191
0: Vulnerability Gather data sampling: Not affected
0: Vulnerability Itlb multihit: Not affected
0: Vulnerability L1tf: Not affected
0: Vulnerability Mds: Not affected
0: Vulnerability Meltdown: Not affected
0: Vulnerability Mmio stale data: Not affected
0: Vulnerability Retbleed: Not affected
0: Vulnerability Spec rstack overflow: Mitigation; safe RET
0: Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
0: Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
0: Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
0: Vulnerability Srbds: Not affected
0: Vulnerability Tsx async abort: Not affected
0: Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
0:
0: Versions of relevant libraries:
0: [pip3] numpy==1.26.2
0: [pip3] pytorch-ranger==0.1.1
0: [pip3] torch==2.1.0+cu121
0: [pip3] torch-optimizer==0.3.0
0: [pip3] torchmetrics==1.0.3
0: [pip3] torchvision==0.16.0+cu121
0: [pip3] triton==2.1.0
0: [pip3] triton-pre-mlir==2.0.0
0: [conda] Could not collect
0:
0:
0: Composer information
0: --------------------
0: Composer version: 0.17.2
0: Composer commit hash: None
0: Host processor model name: AMD EPYC 7R13 Processor
0: Host processor core count: 96
0: Number of nodes: 1
0: Accelerator model name: NVIDIA H100 80GB HBM3
0: Accelerators per node: 1
0: CUDA Device Count: 8
0:
0:
-->
To reproduce
Steps to reproduce the behavior:
- Save a model checkpoint by setting
fsdp_config.state_dict: sharded
in the config. - Attempt to load it by setting
load_path
to the directory containing the checkpoint files.
Expected behavior
The checkpoint should be loaded and the model should continue training and/or evaluating.
Additional context
Hello @growlix , are you running this in fp8
?
If so, this issue was fixed in https://github.com/mosaicml/composer/pull/2907 and released in v0.19.0, so you should upgrade your composer version.
Thank you so much, @hanlint! We are running in fp8
. We'll update to v0.19.0 and give it a whirl!
@hanlint , we tried composer 0.19.0 but we are still hitting the issue . Is there any change to the config we need to make? we are specifying the load path as the shard prefix following this
30 trainer = Trainer(
31 File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1493, in __init__
32 self._rng_state = checkpoint.load_checkpoint(
33 File "/usr/lib/python3/dist-packages/composer/utils/checkpoint.py", line 366, in load_checkpoint
34 rng_state_dicts = load_sharded_checkpoint(
35 File "/usr/lib/python3/dist-packages/composer/utils/checkpoint.py", line 558, in load_sharded_checkpoint
36 optim_state = load_sharded_optimizer_state_dict(model_state_dict=state.state_dict()['model'],
37 File "/usr/lib/python3/dist-packages/torch/distributed/checkpoint/optimizer.py", line 264, in load_sharded_optimizer_state_dict
38 layout_specs, dp_pg = _get_state_dict_2d_layout(model_state_dict)
39 File "/usr/lib/python3/dist-packages/torch/distributed/checkpoint/optimizer.py", line 128, in _get_state_dict_2d_layout
40 specs[key] = (None, value.size())
41 AttributeError: '_io.BytesIO' object has no attribute 'size'
42 Traceback (most recent call last):
43 File "/fsx/users/prigoyal/experiments/prigoyal/science/20240227-16-10-13_bump-composer/bench-MPT1b-RPJ-fp8-noactckpt-noaccum-bs160-v5docker-flash-noqknorm-sharded-resume15ba/science/tools/train_llms.py", line 632, in <module>
44 main(cfg)
45 File "/fsx/users/prigoyal/experiments/prigoyal/science/20240227-16-10-13_bump-composer/bench-MPT1b-RPJ-fp8-noactckpt-noaccum-bs160-v5docker-flash-noqknorm-sharded-resume15ba/science/tools/train_llms.py", line 564, in main
46 trainer = Trainer(
47 File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1493, in __init__
48 self._rng_state = checkpoint.load_checkpoint(
49 File "/usr/lib/python3/dist-packages/composer/utils/checkpoint.py", line 366, in load_checkpoint
50 rng_state_dicts = load_sharded_checkpoint(
51 File "/usr/lib/python3/dist-packages/composer/utils/checkpoint.py", line 558, in load_sharded_checkpoint
52 optim_state = load_sharded_optimizer_state_dict(model_state_dict=state.state_dict()['model'],
53 File "/usr/lib/python3/dist-packages/torch/distributed/checkpoint/optimizer.py", line 264, in load_sharded_optimizer_state_dict
54 layout_specs, dp_pg = _get_state_dict_2d_layout(model_state_dict)
55 File "/usr/lib/python3/dist-packages/torch/distributed/checkpoint/optimizer.py", line 128, in _get_state_dict_2d_layout
56 specs[key] = (None, value.size())
57 AttributeError: '_io.BytesIO' object has no attribute 'size'