llama icon indicating copy to clipboard operation
llama copied to clipboard

An error occurred while running llama-2-7b

Open 123mtr-A opened this issue 1 year ago • 1 comments

_## Describe the bug When i try to run the llama-2-7b model through torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b/ --tokenizer_path tokenizer.model
--max_seq_len 128 --max_batch_size 4 I encounter the following error message

Traceback (most recent call last): File "/home/ai02/llama/Projecct/example_text_completion.py", line 11, in checkpoint = torch.load('/home/ai02/llama/Projecct/llama-2-7b/checklist.chk', map_location='gpu') File "/home/ai02/anaconda3/envs/llama/lib/python3.10/site-packages/torch/serialization.py", line 1028, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/home/ai02/anaconda3/envs/llama/lib/python3.10/site-packages/torch/serialization.py", line 1246, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: could not find MARK [2023-11-02 18:39:59,543] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 90675) of binary: /home/ai02/anaconda3/envs/llama/bin/python Traceback (most recent call last): File "/home/ai02/anaconda3/envs/llama/bin/torchrun", line 8, in sys.exit(main()) File "/home/ai02/anaconda3/envs/llama/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/ai02/anaconda3/envs/llama/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/home/ai02/anaconda3/envs/llama/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/home/ai02/anaconda3/envs/llama/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ai02/anaconda3/envs/llama/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_text_completion.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-11-02_18:39:59 host : ai02-PR4910P rank : 0 (local_rank: 0) exitcode : 1 (pid: 90675) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Runtime Environment

  • Model: [eg: llama-2-7b]
  • Using via huggingface?: [yes/no]
  • OS: [eg. Ubuntu
  • GPU VRAM: A100
  • Number of GPUs:5
  • GPU Make: Nvidia_

123mtr-A avatar Nov 02 '23 10:11 123mtr-A

I'm not sure what the error is, please paste the full stacktrace. If you made any modifications to the script, include the changes you made.

Also, please adhere to the formatting in the issue template as that helps us understand your issue faster.

subramen avatar Nov 08 '23 13:11 subramen