llm-foundry unable to save ckpt for mpt-30b

unable to save ckpt for mpt-30b

Open bpucla opened this issue 2 years ago • 0 comments

Environment

System Environment Report
Created: 2023-06-24 01:07:08 UTC

PyTorch information

PyTorch version: 2.0.1 Is debug build: False CUDA used to build PyTorch: 11.8 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.26.3 Libc version: glibc-2.31

Python version: 3.10.11 (main, May 16 2023, 00:28:57) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.10.162+-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 11.8.89 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB GPU 1: NVIDIA A100-SXM4-80GB GPU 2: NVIDIA A100-SXM4-80GB GPU 3: NVIDIA A100-SXM4-80GB GPU 4: NVIDIA A100-SXM4-80GB GPU 5: NVIDIA A100-SXM4-80GB GPU 6: NVIDIA A100-SXM4-80GB GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 510.108.03 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 96 On-line CPU(s) list: 0-95 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) CPU @ 2.20GHz Stepping: 7 CPU MHz: 2200.216 BogoMIPS: 4400.43 Hypervisor vendor: KVM Virtualization type: full L1d cache: 1.5 MiB L1i cache: 1.5 MiB L2 cache: 48 MiB L3 cache: 77 MiB NUMA node0 CPU(s): 0-23,48-71 NUMA node1 CPU(s): 24-47,72-95 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities

Versions of relevant libraries: [pip3] numpy==1.24.3 [pip3] pytorch-ranger==0.1.1 [pip3] torch==2.0.1 [pip3] torch-optimizer==0.3.0 [pip3] torchaudio==2.0.2 [pip3] torchdata==0.6.1 [pip3] torchmetrics==0.11.4 [pip3] torchtext==0.15.2 [pip3] torchvision==0.15.2 [conda] blas 1.0 mkl
[conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] mkl 2023.1.0 h6d00ec8_46342
[conda] mkl-service 2.4.0 py310h5eee18b_1
[conda] mkl_fft 1.3.6 py310h1128e8f_1
[conda] mkl_random 1.2.2 py310h1128e8f_1
[conda] numpy 1.24.3 pypi_0 pypi [conda] pytorch 2.0.1 py3.10_cuda11.8_cudnn8.7.0_0 pytorch [conda] pytorch-cuda 11.8 h7e8668a_5 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] pytorch-ranger 0.1.1 pypi_0 pypi [conda] torch-optimizer 0.3.0 pypi_0 pypi [conda] torchaudio 2.0.2 py310_cu118 pytorch [conda] torchdata 0.6.1 pypi_0 pypi [conda] torchmetrics 0.11.4 pypi_0 pypi [conda] torchtext 0.15.2 pypi_0 pypi [conda] torchtriton 2.0.0 py310 pytorch [conda] torchvision 0.15.2 py310_cu118 pytorch

Composer information

Composer version: 0.15.0 Composer commit hash: None Host processor model name: Intel(R) Xeon(R) CPU @ 2.20GHz Host processor core count: 48 Number of nodes: 1 Accelerator model name: NVIDIA A100-SXM4-80GB Accelerators per node: 1 CUDA Device Count: 8

To reproduce

Steps to reproduce the behavior:

composer train/train.py train/yamls/pretrain/mpt-30b.yaml data_local=my-copy-c4-8k train_loader.dataset.split=train_small eval_loader.dataset.split=val_small max_duration=10ba eval_interval=0 save_folder=mpt-30b
Training was done successfully.
Got an error in saving checkpoint. rank0[14870][MainThread]: DEBUG: composer.utils.checkpoint: Saving checkpoint to mpt-30b/ep{epoch}-ba{batch}-rank{rank}.pt ERROR:composer.cli.launcher:Rank 0 crashed with exit code -9.

Expected behavior

save checkpoints

Additional context

the code is able to save ckpts for mpt-125m but not mpt-30b. System memory is 1.3T.

Jun 24 '23 01:06 bpucla

llm-foundry llm-foundry copied to clipboard

unable to save ckpt for mpt-30b

Environment

System Environment Report Created: 2023-06-24 01:07:08 UTC

PyTorch information

Composer information

To reproduce

Expected behavior

Additional context

llm-foundry
llm-foundry copied to clipboard

System Environment Report
Created: 2023-06-24 01:07:08 UTC