gpt-neox
gpt-neox copied to clipboard
Cannot load the checkpoint
Describe the bug It generate the error when running the generate program To Reproduce Steps to reproduce the behavior:
- run "./deepy.py generate.py ./configs/20B.yml -i prompt.txt -o sample_outputs.txt"
- the error is raised as below:
Loading extension module utils...
Loading extension module utils...
Loading extension module utils...
Loading extension module utils...
Traceback (most recent call last):
File "generate.py", line 91, in
main() File "generate.py", line 33, in main model, neox_args = setup_for_inference_or_eval(use_cache=True) File "/work/c272987/gpt-neox/megatron/utils.py", line 440, in setup_for_inference_or_eval model, _, _ = setup_model_and_optimizer( File "/work//gpt-neox/megatron/training.py", line 447, in setup_model_and_optimizer neox_args.iteration = load_checkpoint( File "/work//gpt-neox/megatron/checkpointing.py", line 239, in load_checkpoint checkpoint_name, state_dict = model.load_checkpoint( File "/work//gpt-neox/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1523, in load_checkpoint load_path, client_states = self._load_checkpoint(load_dir, File "/work//gpt-neox/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1558, in _load_checkpoint self.load_module_state_dict(state_dict=checkpoint['module'], File "/work//gpt-neox/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1278, in load_module_state_dict self.module.load_state_dir(load_dir=self._curr_ckpt_path, strict=strict) File "/work//gpt-neox/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 571, in load_state_dir layer.load_state_dict(torch.load(model_ckpt_path, File "/work//gpt-neox/venv/lib/python3.8/site-packages/torch/serialization.py", line 778, in load with _open_zipfile_reader(opened_file) as opened_zipfile: File "/work//gpt-neox/venv/lib/python3.8/site-packages/torch/serialization.py", line 282, in init super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer)) RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/work/c272987/gpt-neox/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 179, in main() File "/work/c272987/gpt-neox/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 169, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/work/c272987/gpt-neox/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 147, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
Expected behavior runs smoothly
Environment (please complete the following information):
- GPUs: 4
- Configs: 20B
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
When this message shows up, it usually implies that one of the checkpoint files is incomplete (e.g. broken during transfer). Can you check the local files?
i have error is: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 23.70 GiB total capacity; 7.89 GiB already allocated; 39.19 MiB free; 7.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
i have error is: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 23.70 GiB total capacity; 7.89 GiB already allocated; 39.19 MiB free; 7.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
How big is your GPU? You need a rather large GPU to load a 20B model, and it seems you simply don’t have enough VRAM.
Hi @StellaAthena , I'm trynig to run inference and finetunning using 20B with 8 x NVIDIA A10G 23GB VRAM and still got the
RuntimeError: CUDA out of memory. Tried to allocate 9.59 GiB (GPU 0; 22.04 GiB total capacity; 14.39 GiB already allocated; 7.00 GiB free; 14.39 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
None of the bellow config work:
"pipe-parallel-size": 8|4|2|1, "model-parallel-size": 1|2|4|8,
I'm running Version 2.0 of GPT-NeoX Do you have any tips on how to improve config and be able to run it?
I was able tu run using HF version https://github.com/mallorbc/GPTNeoX20B_HuggingFace