llama AssertionError: Loading a checkpoint for MP=1 but world size is None

AssertionError: Loading a checkpoint for MP=1 but world size is None

Open teriterance opened this issue 1 year ago • 4 comments

Hello, I have this error my comand is torchrun --nproc_per_node 1 test.py --file_name=stopwords.txt --output_name=ttu1.csv --ckpt_dir=llama-2-7b --tokenizer_path=tokenizer.model I got the result

Traceback (most recent call last): File "test.py", line 89, in fire.Fire(file_or_dataframe_to_df) File "/home/ubuntu/.local/lib/python3.8/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/ubuntu/.local/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/ubuntu/.local/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "test.py", line 83, in file_or_dataframe_to_df result_list.append([word]+generate_result(generate_different_prompt(word=word), ckpt_dir=ckpt_dir, tokenizer_path=tokenizer_path)) File "test.py", line 42, in generate_result generator = Llama.build( File "/home/ubuntu/llama/llama/generation.py", line 79, in build assert model_parallel_size == len( AssertionError: Loading a checkpoint for MP=1 but world size is None

Aug 29 '23 07:08 teriterance

@teriterance Can you provide more details? For example, can you run nvidia-smi?

Aug 29 '23 10:08 EmanuelaBoros

I'm having the same issue. Are there any updates on this?

Oct 11 '23 14:10 ashleylew

@ashleylew Can you provide more details about your system? For example, can you run nvidia-smi?

Nov 09 '23 09:11 EmanuelaBoros

@teriterance @EmanuelaBoros @ashleylew This worked out for me : Try to send model_parallel_size parameter in llama.build ::::

if distributed_training:
    rank = int(os.environ.get("RANK", "0"))  
    world_size = int(os.environ.get("WORLD_SIZE", "1"))
    os.environ["RANK"] = str(rank)
    os.environ["WORLD_SIZE"] = str(world_size)  
    os.environ["MASTER_ADDR"] = ""  # Set your own master address
    os.environ["MASTER_PORT"] = ""  # Set your own master port
    model_parallel_size = int(os.environ.get("WORLD_SIZE", 1))
generator = Llama.build(ckpt_dir=ckpt_dir, tokenizer_path=tokenizer_path, max_seq_len=max_seq_len, max_batch_size=max_batch_size,model_parallel_size=model_parallel_size)

Mar 01 '24 07:03 shakshamCodes

llama llama copied to clipboard

AssertionError: Loading a checkpoint for MP=1 but world size is None

llama
llama copied to clipboard