llama-stack icon indicating copy to clipboard operation
llama-stack copied to clipboard

Checkpoint Cannot Be Found For Llama 405B Model

Open dawenxi-007 opened this issue 6 months ago • 1 comments

Trying to run inference with FP8 version of Llama 3.1 405B model (Meta-Llama3.1-405B-Instruct). The model was downloaded with llama download --source huggingface --model-id Meta-Llama3.1-405B-Instruct --hf-token TOKEN. However, the command llama distribution start --name local-llama-405b --port 5000 --disable-ipv6 gave the following error:

------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-25_04:55:10
  host      : node007
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3088723)
  error_file: /tmp/torchelastic_kvox4nb5/ee89349c-cc4c-43c4-9796-1ceeb2986a3b_ugr5p160/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/home/ubuntu/miniforge3/envs/local-llama-405b/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
      return f(*args, **kwargs)
    File "/home/ubuntu/taoz/llama-stack/llama_toolchain/inference/meta_reference/parallel_utils.py", line 131, in worker_process_entrypoint
      model = init_model_cb()
    File "/home/ubuntu/taoz/llama-stack/llama_toolchain/inference/meta_reference/model_parallel.py", line 48, in init_model_cb
      llama = Llama.build(config)
    File "/home/ubuntu/taoz/llama-stack/llama_toolchain/inference/meta_reference/generation.py", line 100, in build
      assert len(checkpoints) > 0, f"no checkpoint files found in {ckpt_dir}"
  AssertionError: no checkpoint files found in /home/ubuntu/.llama/checkpoints/Meta-Llama3.1-405B-Instruct/original

Under the original folder,

ubuntu@node007:/mnt/nfs_share/taoz/.llama/checkpoints/Meta-Llama3.1-405B-Instruct/original$ ls
consolidated.00  consolidated.02  consolidated.04  consolidated.06  fp8_scales_0.pt  fp8_scales_2.pt  fp8_scales_4.pt  fp8_scales_6.pt  params.json  tokenizer.model
consolidated.01  consolidated.03  consolidated.05  consolidated.07  fp8_scales_1.pt  fp8_scales_3.pt  fp8_scales_5.pt  fp8_scales_7.pt  README.md

consolidated.xx are folders instead of files, I think that's probably why they were not found.

dawenxi-007 avatar Aug 28 '24 06:08 dawenxi-007