llama
llama copied to clipboard
An error occurred while running llama-2-7b
_## Describe the bug
When i try to run the llama-2-7b model through
torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b/ --tokenizer_path tokenizer.model
--max_seq_len 128 --max_batch_size 4 I encounter the following error message
Traceback (most recent call last):
File "/home/ai02/llama/Projecct/example_text_completion.py", line 11, in
checkpoint = torch.load('/home/ai02/llama/Projecct/llama-2-7b/checklist.chk', map_location='gpu')
File "/home/ai02/anaconda3/envs/llama/lib/python3.10/site-packages/torch/serialization.py", line 1028, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/home/ai02/anaconda3/envs/llama/lib/python3.10/site-packages/torch/serialization.py", line 1246, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: could not find MARK
[2023-11-02 18:39:59,543] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 90675) of binary: /home/ai02/anaconda3/envs/llama/bin/python
Traceback (most recent call last):
File "/home/ai02/anaconda3/envs/llama/bin/torchrun", line 8, in
sys.exit(main())
File "/home/ai02/anaconda3/envs/llama/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/ai02/anaconda3/envs/llama/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/ai02/anaconda3/envs/llama/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/ai02/anaconda3/envs/llama/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ai02/anaconda3/envs/llama/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
example_text_completion.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-11-02_18:39:59 host : ai02-PR4910P rank : 0 (local_rank: 0) exitcode : 1 (pid: 90675) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Runtime Environment
- Model: [eg:
llama-2-7b
] - Using via huggingface?: [yes/no]
- OS: [eg. Ubuntu
- GPU VRAM: A100
- Number of GPUs:5
- GPU Make: Nvidia_
I'm not sure what the error is, please paste the full stacktrace. If you made any modifications to the script, include the changes you made.
Also, please adhere to the formatting in the issue template as that helps us understand your issue faster.