exo icon indicating copy to clipboard operation
exo copied to clipboard

download_file is corrupting the file if downloaded_size is not equal to total_size

Open NebulaRover77 opened this issue 1 year ago • 2 comments

I was getting errors like "Error processing prompt: Invalid control character at: line 563 column 62 (char 47769)" while running exo. While investigating the problem I came across a corrupted index.json file:

% cat /Volumes/Palladium/cache/huggingface/hub/models--mlx-community--DeepSeek-R1-Distill-Qwen-1.5B-3bit/snapshots/17131b28b8190d7b2aceb9b98926f263606a11ee/model.safetensors.index.json | tail -n 10 "model.layers.9.self_attn.q_proj.scales": "model.safetensors", "model.layers.9.self_attn.q_proj.weight": "model.safetensors", "model.layers.9.self_attn.v_proj.bias": "model.safetensors", "model.layers.9.self_attn.v_proj.biases": "model.safetensors", "model.layers.9.self_attn.v_proj.scales": "model.safetensors", "model.layers.9.self_attn.v_proj.weight": "model.safetensors", "model.norm.weight": "model.safetensors" } } Range not satisfiableRange not satisfiable%

With the help of ChatGPT (o1), I tracked it down to a bug in hf_helpers.py:

elif response.status == 416: # Range not satisfiable content_range = response.headers.get('Content-Range', '') try: total_size = int(content_range.split('/')[-1]) if downloaded_size == total_size: if DEBUG >= 2: print(f"File fully downloaded on first pass: {file_path}") ... return except ValueError: if DEBUG >= 1: print(f"Failed to parse Content-Range header: {content_range}. Starting download from scratch...") return await download_file(..., use_range_request=False)

As ChatGPT says, "After that code block, there’s no return or raise if downloaded_size != total_size. The function simply “falls through” to this part."

This looks like a bug to me. Here is the fix suggested by ChatGPT o1 (after a few suggestions I made):

diff --git a/exo/download/hf/hf_helpers.py b/exo/download/hf/hf_helpers.py
index 8be12b9..ff3156c 100644
--- a/exo/download/hf/hf_helpers.py
+++ b/exo/download/hf/hf_helpers.py
@@ -202,6 +202,29 @@ async def download_file(
           if progress_callback:
             await progress_callback(RepoFileProgressEvent(repo_id, revision, file_path, downloaded_size, downloaded_this_session, total_size, 0, timedelta(0), "complete"))
           return
+        if downloaded_size > total_size:
+            # Local file is bigger than remote file => definitely corrupted
+            if DEBUG >= 1:
+                print(
+                    f"Local file size ({downloaded_size}) is greater than remote file size ({total_size}). "
+                    "Removing local file and restarting download from scratch..."
+                )
+            # Remove the local file and redownload entirely
+            await aios.remove(local_path)
+            return await download_file(
+                session, repo_id, revision, file_path, save_directory, progress_callback, use_range_request=False
+            )
+
+        if downloaded_size < total_size:
+            # We haven't yet downloaded the full file, so the requested range is invalid
+            if DEBUG >= 1:
+                print(
+                    f"Partial file mismatch: local={downloaded_size}, total={total_size}. "
+                    "Retrying from scratch without range requests..."
+                )
+            return await download_file(
+                session, repo_id, revision, file_path, save_directory, progress_callback, use_range_request=False
+            )
       except ValueError:
         if DEBUG >= 1: print(f"Failed to parse Content-Range header: {content_range}. Starting download from scratch...")
         return await download_file(session, repo_id, revision, file_path, save_directory, progress_callback, use_range_request=False)

Seems to work. Let me know.

NebulaRover77 avatar Jan 26 '25 20:01 NebulaRover77

Should be fixed with https://github.com/exo-explore/exo/pull/640 Let me know if it's still an issue.

AlexCheema avatar Jan 27 '25 19:01 AlexCheema

Should be fixed with #640 Let me know if it's still an issue.

Is it possible that it is no longer respecting HF_HOME? It seems to be downloading everything to ~/.cache/exo/downloads/, even though I have set a different folder for HF_HOME.

EDIT: Nevermind, I see that HF_HOME has changed to EXO_HOME.

EDIT 2: I see you also changed from hub/ to downloads/. I wonder if this means that I shouldn't be trying to share my huggingface directory between exo and other programs that use huggingface or if I just need to set the env variables up in order to do this. (I'd prefer not to use a symbolic link but that would be another alternative.)

EDIT 3: Okay looks like the directory format has completely changed. Also I'm getting weird behavior if I have HF_HOME and EXO_HOME both set. So I'm kind of confused as to how to transition everything without re-downloading all the models.

NebulaRover77 avatar Jan 27 '25 23:01 NebulaRover77

Fixed in 1.0.

Evanev7 avatar Dec 18 '25 18:12 Evanev7