gpt-neox Cannot load the checkpoint

Describe the bug It generate the error when running the generate program To Reproduce Steps to reproduce the behavior:

run "./deepy.py generate.py ./configs/20B.yml -i prompt.txt -o sample_outputs.txt"
the error is raised as below: Loading extension module utils... Loading extension module utils... Loading extension module utils... Loading extension module utils... Traceback (most recent call last): File "generate.py", line 91, in main() File "generate.py", line 33, in main model, neox_args = setup_for_inference_or_eval(use_cache=True) File "/work/c272987/gpt-neox/megatron/utils.py", line 440, in setup_for_inference_or_eval model, _, _ = setup_model_and_optimizer( File "/work//gpt-neox/megatron/training.py", line 447, in setup_model_and_optimizer neox_args.iteration = load_checkpoint( File "/work//gpt-neox/megatron/checkpointing.py", line 239, in load_checkpoint checkpoint_name, state_dict = model.load_checkpoint( File "/work//gpt-neox/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1523, in load_checkpoint load_path, client_states = self._load_checkpoint(load_dir, File "/work//gpt-neox/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1558, in _load_checkpoint self.load_module_state_dict(state_dict=checkpoint['module'], File "/work//gpt-neox/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1278, in load_module_state_dict self.module.load_state_dir(load_dir=self._curr_ckpt_path, strict=strict) File "/work//gpt-neox/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 571, in load_state_dir layer.load_state_dict(torch.load(model_ckpt_path, File "/work//gpt-neox/venv/lib/python3.8/site-packages/torch/serialization.py", line 778, in load with _open_zipfile_reader(opened_file) as opened_zipfile: File "/work//gpt-neox/venv/lib/python3.8/site-packages/torch/serialization.py", line 282, in init super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer)) RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/work/c272987/gpt-neox/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 179, in main() File "/work/c272987/gpt-neox/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 169, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/work/c272987/gpt-neox/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 147, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)

Expected behavior runs smoothly

Environment (please complete the following information):

GPUs: 4
Configs: 20B

Feb 06 '23 16:02 jmlongriver12

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

When this message shows up, it usually implies that one of the checkpoint files is incomplete (e.g. broken during transfer). Can you check the local files?

Feb 11 '23 22:02 syskn

i have error is: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 23.70 GiB total capacity; 7.89 GiB already allocated; 39.19 MiB free; 7.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Mar 22 '23 00:03 cywjava

i have error is: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 23.70 GiB total capacity; 7.89 GiB already allocated; 39.19 MiB free; 7.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

How big is your GPU? You need a rather large GPU to load a 20B model, and it seems you simply don’t have enough VRAM.

Mar 24 '23 04:03 StellaAthena

Hi @StellaAthena , I'm trynig to run inference and finetunning using 20B with 8 x NVIDIA A10G 23GB VRAM and still got the

RuntimeError: CUDA out of memory. Tried to allocate 9.59 GiB (GPU 0; 22.04 GiB total capacity; 14.39 GiB already allocated; 7.00 GiB free; 14.39 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

None of the bellow config work:

"pipe-parallel-size": 8|4|2|1, "model-parallel-size": 1|2|4|8,

I'm running Version 2.0 of GPT-NeoX Do you have any tips on how to improve config and be able to run it?

Apr 05 '23 08:04 heukirne

I was able tu run using HF version https://github.com/mallorbc/GPTNeoX20B_HuggingFace

Apr 06 '23 16:04 heukirne

gpt-neox gpt-neox copied to clipboard

Cannot load the checkpoint

gpt-neox
gpt-neox copied to clipboard