llama-stack
llama-stack copied to clipboard
Checkpoint Cannot Be Found For Llama 405B Model
Trying to run inference with FP8 version of Llama 3.1 405B model (Meta-Llama3.1-405B-Instruct). The model was downloaded with llama download --source huggingface --model-id Meta-Llama3.1-405B-Instruct --hf-token TOKEN
. However, the command llama distribution start --name local-llama-405b --port 5000 --disable-ipv6
gave the following error:
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-08-25_04:55:10
host : node007
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3088723)
error_file: /tmp/torchelastic_kvox4nb5/ee89349c-cc4c-43c4-9796-1ceeb2986a3b_ugr5p160/attempt_0/0/error.json
traceback : Traceback (most recent call last):
File "/home/ubuntu/miniforge3/envs/local-llama-405b/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/ubuntu/taoz/llama-stack/llama_toolchain/inference/meta_reference/parallel_utils.py", line 131, in worker_process_entrypoint
model = init_model_cb()
File "/home/ubuntu/taoz/llama-stack/llama_toolchain/inference/meta_reference/model_parallel.py", line 48, in init_model_cb
llama = Llama.build(config)
File "/home/ubuntu/taoz/llama-stack/llama_toolchain/inference/meta_reference/generation.py", line 100, in build
assert len(checkpoints) > 0, f"no checkpoint files found in {ckpt_dir}"
AssertionError: no checkpoint files found in /home/ubuntu/.llama/checkpoints/Meta-Llama3.1-405B-Instruct/original
Under the original folder,
ubuntu@node007:/mnt/nfs_share/taoz/.llama/checkpoints/Meta-Llama3.1-405B-Instruct/original$ ls
consolidated.00 consolidated.02 consolidated.04 consolidated.06 fp8_scales_0.pt fp8_scales_2.pt fp8_scales_4.pt fp8_scales_6.pt params.json tokenizer.model
consolidated.01 consolidated.03 consolidated.05 consolidated.07 fp8_scales_1.pt fp8_scales_3.pt fp8_scales_5.pt fp8_scales_7.pt README.md
consolidated.xx
are folders instead of files, I think that's probably why they were not found.