exo icon indicating copy to clipboard operation
exo copied to clipboard

Unable to load model

Open ImagineMiracle-wxn opened this issue 10 months ago • 3 comments

system: Ubuntu 24.10 I executed the command abc and got this debug log. webui always displays 'Checking download status...'

Image

Received request: GET /v1/download/progress Received request: GET /v1/download/progress Download error on attempt 5/30 for repo_id='mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated' revision='main' path='model.safetensors.index.json' target_dir=PosixPath('/tmp/exo/mlabonne--Meta-Llama-3.1-8B-Instruct-abliterated') Traceback (most recent call last): File "/home/imaginemiracle/Downloads/exo/exo/download/new_shard_download.py", line 134, in download_file_with_retry try: return await _download_file(repo_id, revision, path, target_dir, on_progress) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/imaginemiracle/Downloads/exo/exo/download/new_shard_download.py", line 164, in _download_file raise Exception(f"Downloaded file {target_dir/path} has hash {final_hash} but remote hash is {remote_hash}") Exception: Downloaded file /tmp/exo/mlabonne--Meta-Llama-3.1-8B-Instruct-abliterated/model.safetensors.index.json has hash 0fd8120f1c6acddc268ebc2583058efaf699a771 but remote hash is 0fd8120f1c6acddc268ebc2583058efaf699a771-gzip Received request: GET /v1/download/progress Received request: GET /v1/download/progress update_peers: added=[] removed=[] updated=[] unchanged=[<exo.networking.grpc.grpc_peer_handle.GRPCPeerHandle object at 0x747f587c8f20>] to_disconnect=[] to_connect=[] did_peers_change=False Collecting topology max_depth=4 visited=set() Collected topology from: 7abb6259-9497-473e-8964-212f353004e9: Topology(Nodes: {7abb6259-9497-473e-8964-212f353004e9: Model: Linux Box (Device: CLANG). Chip: Unknown Chip (Device: CLANG). Memory: 64354MB. Flops: fp32: 0.00 TFLOPS, fp16: 0.00 TFLOPS, int8: 0.00 TFLOPS}, Edges: {}) Received request: GET /v1/topology Download error on attempt 8/30 for repo_id='mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated' revision='main' path='model.safetensors.index.json' target_dir=PosixPath('/tmp/exo/mlabonne--Meta-Llama-3.1-8B-Instruct-abliterated') Traceback (most recent call last): File "/home/imaginemiracle/Downloads/exo/exo/download/new_shard_download.py", line 134, in download_file_with_retry try: return await _download_file(repo_id, revision, path, target_dir, on_progress) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/imaginemiracle/Downloads/exo/exo/download/new_shard_download.py", line 164, in _download_file raise Exception(f"Downloaded file {target_dir/path} has hash {final_hash} but remote hash is {remote_hash}") Exception: Downloaded file /tmp/exo/mlabonne--Meta-Llama-3.1-8B-Instruct-abliterated/model.safetensors.index.json has hash 0fd8120f1c6acddc268ebc2583058efaf699a771 but remote hash is 0fd8120f1c6acddc268ebc2583058efaf699a771-gzip Received request: GET /v1/download/progress Download error on attempt 8/30 for repo_id='unsloth/Llama-3.3-70B-Instruct' revision='main' path='model.safetensors.index.json' target_dir=PosixPath('/tmp/exo/unsloth--Llama-3.3-70B-Instruct') Traceback (most recent call last): File "/home/imaginemiracle/Downloads/exo/exo/download/new_shard_download.py", line 134, in download_file_with_retry try: return await _download_file(repo_id, revision, path, target_dir, on_progress) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/imaginemiracle/Downloads/exo/exo/download/new_shard_download.py", line 164, in _download_file raise Exception(f"Downloaded file {target_dir/path} has hash {final_hash} but remote hash is {remote_hash}") Exception: Downloaded file /tmp/exo/unsloth--Llama-3.3-70B-Instruct/model.safetensors.index.json has hash 37b1afe63cadc4ddce30aaff1b149c2f3083650c but remote hash is 37b1afe63cadc4ddce30aaff1b149c2f3083650c-gzip Download error on attempt 8/30 for repo_id='TriAiExperiments/SFR-Iterative-DPO-LLaMA-3-70B-R' revision='main' path='model.safetensors.index.json' target_dir=PosixPath('/tmp/exo/TriAiExperiments--SFR-Iterative-DPO-LLaMA-3-70B-R') Traceback (most recent call last): File "/home/imaginemiracle/Downloads/exo/exo/download/new_shard_download.py", line 134, in download_file_with_retry try: return await _download_file(repo_id, revision, path, target_dir, on_progress) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/imaginemiracle/Downloads/exo/exo/download/new_shard_download.py", line 155, in _download_file assert r.status in [200, 206], f"Failed to download {path} from {url}: {r.status}" ^^^^^^^^^^^^^^^^^^^^^^ AssertionError: Failed to download model.safetensors.index.json from https://hf-mirror.com/TriAiExperiments/SFR-Iterative-DPO-LLaMA-3-70B-R/resolve/main/model.safetensors.index.json: 401 Download error on attempt 8/30 for repo_id='NousResearch/Meta-Llama-3.1-70B-Instruct' revision='main' path='model.safetensors.index.json' target_dir=PosixPath('/tmp/exo/NousResearch--Meta-Llama-3.1-70B-Instruct') Traceback (most recent call last): File "/home/imaginemiracle/Downloads/exo/exo/download/new_shard_download.py", line 134, in download_file_with_retry try: return await _download_file(repo_id, revision, path, target_dir, on_progress) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/imaginemiracle/Downloads/exo/exo/download/new_shard_download.py", line 164, in _download_file raise Exception(f"Downloaded file {target_dir/path} has hash {final_hash} but remote hash is {remote_hash}") Exception: Downloaded file /tmp/exo/NousResearch--Meta-Llama-3.1-70B-Instruct/model.safetensors.index.json has hash 37b1afe63cadc4ddce30aaff1b149c2f3083650c but remote hash is 37b1afe63cadc4ddce30aaff1b149c2f3083650c-gzip

ImagineMiracle-wxn avatar Feb 26 '25 02:02 ImagineMiracle-wxn

In my case happened the same and I solve it!

After running exo, I copied the messages appearing in the terminal into Claude Sonnet to ask about it. This is part of the terminal message:

                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...<7 lines>...
)
^

File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages /aiohttp/connector.py", line 1118, in _wrap_create_connection raise ClientConnectorCertificateError(req.connection_key, exc) from exc aiohttp.client_exceptions.ClientConnectorCertificateError: Cannot connect to host huggingface.co:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1028)')] Download error on attempt 10/30 for repo_id='mlx-community/Mistral-Nemo-Instruct-2407-4bit' revision='main' path='model.safetensors.index.json' target_dir=PosixPath('/var/folders/36/zn5d63wx1dj7vm3tpf4hxlpw0000gn/T/exo/mlx-c ommunity--Mistral-Nemo-Instruct-2407-4bit') Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages /aiohttp/connector.py", line 1116, in _wrap_create_connection return await self._loop.create_connection(*args, **kwargs, sock=sock) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

and this was the answer I found:

I can see you're running into SSL certificate verification issues on macOS. This is a common problem when Python can't find the necessary certificates to verify the connection to Hugging Face. Looking at the error: SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1028)], this is exactly the issue mentioned in the "Known Issues" section of the exo documentation. Here's how to fix it:

Run the macOS certificate installation command:

bashCopy/Applications/Python\ 3.13/Install\ Certificates.command

(Adjust the Python version if needed to match your installation)

After installing the certificates, try running exo again.

Have fun!!

locoboy76 avatar Feb 26 '25 17:02 locoboy76

What commit are you running on? Should be fixed with https://github.com/exo-explore/exo/commit/af734f1bf6cca5c13abf934391b2474093723e1b

AlexCheema avatar Feb 26 '25 20:02 AlexCheema

In my case happened the same and I solve it!

After running exo, I copied the messages appearing in the terminal into Claude Sonnet to ask about it. This is part of the terminal message:

                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...<7 lines>...
)
^

File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages /aiohttp/connector.py", line 1118, in _wrap_create_connection raise ClientConnectorCertificateError(req.connection_key, exc) from exc aiohttp.client_exceptions.ClientConnectorCertificateError: Cannot connect to host huggingface.co:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1028)')] Download error on attempt 10/30 for repo_id='mlx-community/Mistral-Nemo-Instruct-2407-4bit' revision='main' path='model.safetensors.index.json' target_dir=PosixPath('/var/folders/36/zn5d63wx1dj7vm3tpf4hxlpw0000gn/T/exo/mlx-c ommunity--Mistral-Nemo-Instruct-2407-4bit') Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages /aiohttp/connector.py", line 1116, in _wrap_create_connection return await self._loop.create_connection(*args, **kwargs, sock=sock) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

and this was the answer I found:

I can see you're running into SSL certificate verification issues on macOS. This is a common problem when Python can't find the necessary certificates to verify the connection to Hugging Face. Looking at the error: SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1028)], this is exactly the issue mentioned in the "Known Issues" section of the exo documentation. Here's how to fix it:

Run the macOS certificate installation command:

bashCopy/Applications/Python\ 3.13/Install\ Certificates.command

(Adjust the Python version if needed to match your installation)

After installing the certificates, try running exo again.

Have fun!!

This is a different error to the OP. It's in the troubleshooting section of the README

AlexCheema avatar Feb 26 '25 21:02 AlexCheema

I have the same issue... I thought I had some how kicked off a 70B download and was trying to figure out how to cancel it 🤣 . Turns out the root cause is that the the models in question no longer exist.

The downloader reports 401

Download error on attempt 0/30 for repo_id='TriAiExperiments/SFR-Iterative-DPO-LLaMA-3-70B-R' revision='main' path='model.safetensors.index.json'
target_dir=PosixPath('/tmp/exo/TriAiExperiments--SFR-Iterative-DPO-LLaMA-3-70B-R')
Traceback (most recent call last):
  File "/nix/store/xh5i8j6kpa7i37yhf10kzwvxxnnk822m-exo-0.15.0-alpha/lib/python3.12/site-packages/exo/download/new_shard_download.py", line 134, in
download_file_with_retry
    try: return await _download_file(repo_id, revision, path, target_dir, on_progress)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/xh5i8j6kpa7i37yhf10kzwvxxnnk822m-exo-0.15.0-alpha/lib/python3.12/site-packages/exo/download/new_shard_download.py", line 156, in
_download_file
    assert r.status in [200, 206], f"Failed to download {path} from {url}: {r.status}"
           ^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Failed to download model.safetensors.index.json from
https://huggingface.co/TriAiExperiments/SFR-Iterative-DPO-LLaMA-3-70B-R/resolve/main/model.safetensors.index.json: 401

And if you go the page (https://huggingface.co/TriAiExperiments/SFR-Iterative-DPO-LLaMA-3-70B-R) you see that the model is gone.

Downloader could do a bit better handling that scenario.

deftdawg avatar Feb 28 '25 05:02 deftdawg

In my case happened the same and I solve it!

After running exo, I copied the messages appearing in the terminal into Claude Sonnet to ask about it. This is part of the terminal message:

                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...<7 lines>...
)
^

File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages /aiohttp/connector.py", line 1118, in _wrap_create_connection raise ClientConnectorCertificateError(req.connection_key, exc) from exc aiohttp.client_exceptions.ClientConnectorCertificateError: Cannot connect to host huggingface.co:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1028)')] Download error on attempt 10/30 for repo_id='mlx-community/Mistral-Nemo-Instruct-2407-4bit' revision='main' path='model.safetensors.index.json' target_dir=PosixPath('/var/folders/36/zn5d63wx1dj7vm3tpf4hxlpw0000gn/T/exo/mlx-c ommunity--Mistral-Nemo-Instruct-2407-4bit') Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages /aiohttp/connector.py", line 1116, in _wrap_create_connection return await self._loop.create_connection(*args, **kwargs, sock=sock) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

and this was the answer I found:

I can see you're running into SSL certificate verification issues on macOS. This is a common problem when Python can't find the necessary certificates to verify the connection to Hugging Face. Looking at the error: SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1028)], this is exactly the issue mentioned in the "Known Issues" section of the exo documentation. Here's how to fix it:

Run the macOS certificate installation command:

bashCopy/Applications/Python\ 3.13/Install\ Certificates.command

(Adjust the Python version if needed to match your installation)

After installing the certificates, try running exo again.

Have fun!!

Thank you @locoboy76 for providing one solution. However, I am currently using conda for the venv of exo, and the certifi.where() can be listed to show where cacert.pem exactly is, actually the base and venv did install certifi correctly. Meanwhile, my case is a bit different, I had been using the huggingface mirror, having the same problem in _download_file raise Exception(f"Downloaded file {target_dir/path} has hash {final_hash} but remote hash is {remote_hash}") Exception:, please help.

xuanzhec avatar Feb 28 '25 09:02 xuanzhec

I have the same issue... I thought I had some how kicked off a 70B download and was trying to figure out how to cancel it 🤣 . Turns out the root cause is that the the models in question no longer exist.

The downloader reports 401

Download error on attempt 0/30 for repo_id='TriAiExperiments/SFR-Iterative-DPO-LLaMA-3-70B-R' revision='main' path='model.safetensors.index.json' target_dir=PosixPath('/tmp/exo/TriAiExperiments--SFR-Iterative-DPO-LLaMA-3-70B-R') Traceback (most recent call last): File "/nix/store/xh5i8j6kpa7i37yhf10kzwvxxnnk822m-exo-0.15.0-alpha/lib/python3.12/site-packages/exo/download/new_shard_download.py", line 134, in download_file_with_retry try: return await _download_file(repo_id, revision, path, target_dir, on_progress) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/nix/store/xh5i8j6kpa7i37yhf10kzwvxxnnk822m-exo-0.15.0-alpha/lib/python3.12/site-packages/exo/download/new_shard_download.py", line 156, in _download_file assert r.status in [200, 206], f"Failed to download {path} from {url}: {r.status}" ^^^^^^^^^^^^^^^^^^^^^^ AssertionError: Failed to download model.safetensors.index.json from https://huggingface.co/TriAiExperiments/SFR-Iterative-DPO-LLaMA-3-70B-R/resolve/main/model.safetensors.index.json: 401 And if you go the page (https://huggingface.co/TriAiExperiments/SFR-Iterative-DPO-LLaMA-3-70B-R) you see that the model is gone.

Downloader could do a bit better handling that scenario.

How would you suggest we handle a repo being deleted by the owner? Downloader handles it by ignoring it and logging an error.

AlexCheema avatar Feb 28 '25 22:02 AlexCheema

Catch block that puts a little red ! for the model status in the UI?

If I get a chance I'll take a crack at a patch after I get tinygrad rebuilt w/ ROCm support on NixOS and I'm finally able to do GPU inferencing on my 6900XT.

deftdawg avatar Feb 28 '25 22:02 deftdawg

Catch block that puts a little red ! for the model status in the UI?

If I get a chance I'll take a crack at a patch after I get tinygrad rebuilt w/ ROCm support on NixOS and I'm finally able to do GPU inferencing on my 6900XT.

Awesome - would be great, thank you!

AlexCheema avatar Feb 28 '25 22:02 AlexCheema

@AlexCheema hi Alex I am just curious about how I shall handle the issue on mac as in _download_file raise Exception(f"Downloaded file {target_dir/path} has hash {final_hash} but remote hash is {remote_hash}") Exception as mentioned if I am using conda as the base, and the .venv for exo has been installed the certifi and certifi.where() can be indicated '~/exo/.venv/lib/python3.12/site-packages/certifi/cacert.pem' and Requirement already satisfied: certifi in ./.venv/lib/python3.12/site-packages (2025.1.31). Thank you in advance.

xuanzhec avatar Mar 03 '25 01:03 xuanzhec