llama
llama copied to clipboard
Cant run inference
torchrun --nproc_per_node 1 example.py --ckpt_dir $TARGET_FOLDER/7B/ --tokenizer_path $TARGET_FOLDER/tokenizer.model
initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Traceback (most recent call last): File "/home/shashank/llama/llama/example.py", line 72, in
fire.Fire(main) File "/home/shashank/anaconda3/lib/python3.9/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/shashank/anaconda3/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/shashank/anaconda3/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/shashank/llama/llama/example.py", line 62, in main generator = load(ckpt_dir, tokenizer_path, local_rank, world_size) File "/home/shashank/llama/llama/example.py", line 35, in load assert ( AssertionError: Loading a checkpoint for MP=0 but world size is 1 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11162) of binary: /home/shashank/anaconda3/bin/python Traceback (most recent call last): File "/home/shashank/anaconda3/bin/torchrun", line 8, in sys.exit(main()) File "/home/shashank/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/shashank/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/shashank/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/shashank/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/shashank/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ example.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-03-03_04:27:49 host : tony rank : 0 (local_rank: 0) exitcode : 1 (pid: 11162) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
I have the same issue, the key line I assume being "AssertionError: Loading a checkpoint for MP=0 but world size is 1"
I dont know if changing this to 0 even makes sense, doesn't that mean you have no GPU? But trying just gives
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 215, in launch_agent
spec = WorkerSpec(
File "<string>", line 15, in __init__
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/agent/server/api.py", line 87, in __post_init__
assert self.local_world_size > 0
AssertionError
Okay, looks like this is caused by missing the Tokenizer.Model file, which doesn't seem to get downloaded with the 7B for some reason? I pinched it from the 65B model and that ran okay, well, I ran out of memory almost instantly, but it looks like it actually was trying to run okay.
Edit: So to be clear, this issue means you are not correctly pointing at your model folder, or way more likely your tokenizer.model file
I guess you didn't export TARGET_FOLDER, resulting in no detection of "*. pt“.
- try python3.10
- change
$TARGET_FOLDER
in your shell command to the folder you download the models, for example: ./models/7B/
these two are obvious issues in your report.