llama
llama copied to clipboard
AssertionError: Loading a checkpoint for MP=1 but world size is None
Hello, I have this error
my comand is
torchrun --nproc_per_node 1 test.py --file_name=stopwords.txt --output_name=ttu1.csv --ckpt_dir=llama-2-7b --tokenizer_path=tokenizer.model
I got the result
Traceback (most recent call last):
File "test.py", line 89, in
@teriterance Can you provide more details? For example, can you run nvidia-smi
?
I'm having the same issue. Are there any updates on this?
@ashleylew Can you provide more details about your system? For example, can you run nvidia-smi?
@teriterance @EmanuelaBoros @ashleylew This worked out for me : Try to send model_parallel_size parameter in llama.build ::::
if distributed_training:
rank = int(os.environ.get("RANK", "0"))
world_size = int(os.environ.get("WORLD_SIZE", "1"))
os.environ["RANK"] = str(rank)
os.environ["WORLD_SIZE"] = str(world_size)
os.environ["MASTER_ADDR"] = "" # Set your own master address
os.environ["MASTER_PORT"] = "" # Set your own master port
model_parallel_size = int(os.environ.get("WORLD_SIZE", 1))
generator = Llama.build(ckpt_dir=ckpt_dir, tokenizer_path=tokenizer_path, max_seq_len=max_seq_len, max_batch_size=max_batch_size,model_parallel_size=model_parallel_size)