llama icon indicating copy to clipboard operation
llama copied to clipboard

Cant run inference

Open shashankyld opened this issue 1 year ago • 4 comments

torchrun --nproc_per_node 1 example.py --ckpt_dir $TARGET_FOLDER/7B/ --tokenizer_path $TARGET_FOLDER/tokenizer.model

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Traceback (most recent call last): File "/home/shashank/llama/llama/example.py", line 72, in fire.Fire(main) File "/home/shashank/anaconda3/lib/python3.9/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/shashank/anaconda3/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/shashank/anaconda3/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/shashank/llama/llama/example.py", line 62, in main generator = load(ckpt_dir, tokenizer_path, local_rank, world_size) File "/home/shashank/llama/llama/example.py", line 35, in load assert ( AssertionError: Loading a checkpoint for MP=0 but world size is 1 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11162) of binary: /home/shashank/anaconda3/bin/python Traceback (most recent call last): File "/home/shashank/anaconda3/bin/torchrun", line 8, in sys.exit(main()) File "/home/shashank/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/shashank/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/shashank/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/shashank/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/shashank/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ example.py FAILED


Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-03-03_04:27:49 host : tony rank : 0 (local_rank: 0) exitcode : 1 (pid: 11162) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

shashankyld avatar Mar 02 '23 22:03 shashankyld

I have the same issue, the key line I assume being "AssertionError: Loading a checkpoint for MP=0 but world size is 1"

I dont know if changing this to 0 even makes sense, doesn't that mean you have no GPU? But trying just gives

Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 215, in launch_agent
    spec = WorkerSpec(
  File "<string>", line 15, in __init__
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/agent/server/api.py", line 87, in __post_init__
    assert self.local_world_size > 0
AssertionError

Urammar avatar Mar 03 '23 00:03 Urammar

Okay, looks like this is caused by missing the Tokenizer.Model file, which doesn't seem to get downloaded with the 7B for some reason? I pinched it from the 65B model and that ran okay, well, I ran out of memory almost instantly, but it looks like it actually was trying to run okay.

Edit: So to be clear, this issue means you are not correctly pointing at your model folder, or way more likely your tokenizer.model file

Urammar avatar Mar 03 '23 00:03 Urammar

I guess you didn't export TARGET_FOLDER, resulting in no detection of "*. pt“.

27182812 avatar Mar 03 '23 06:03 27182812

  1. try python3.10
  2. change $TARGET_FOLDER in your shell command to the folder you download the models, for example: ./models/7B/

these two are obvious issues in your report.

tsaijamey avatar Mar 03 '23 09:03 tsaijamey