torchchat Download Hugging Face models into Hugging Face cache

Download Hugging Face models into Hugging Face cache

Open vmpuri opened this issue 4 months ago • 5 comments

Currently, we download models to a local (~/.torchchat by default). For Hugging Face models, we should download to the Hugging Face cache instead.

As per Hugging Face:

By default, we recommend using the [cache system](https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache) to download files from the Hub. You can specify a custom cache location using the cache_dir parameter in [hf_hub_download()](https://huggingface.co/docs/huggingface_hub/v0.25.1/en/package_reference/file_download#huggingface_hub.hf_hub_download) and [snapshot_download()](https://huggingface.co/docs/huggingface_hub/v0.25.1/en/package_reference/file_download#huggingface_hub.snapshot_download), or by setting the [HF_HOME](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hf_home) environment variable.

This PR also enables hf_transfer, a production-ready Rust library that speeds up downloads from Hugging Face. In my own testing, the speedup was over 2x:

// Before 

python3 torchchat.py download llama3.2-1b 
...
5.63s user 7.31s system 29% cpu 43.139 total

// After
python3 torchchat.py download llama3.2-1b  
...
7.59s user 2.81s system 48% cpu 21.551 total

Testing

Download

python3 torchchat.py download llama3.2-1b

Downloading meta-llama/Meta-Llama-3.2-1B-Instruct from Hugging Face to /Users/puri/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct
.gitattributes: 100%|████████████████| 1.52k/1.52k [00:00<00:00, 4.16MB/s]
LICENSE.txt: 100%|███████████████████| 7.71k/7.71k [00:00<00:00, 52.1MB/s]
README.md: 100%|█████████████████████| 35.9k/35.9k [00:00<00:00, 3.75MB/s]
USE_POLICY.md: 100%|█████████████████| 6.02k/6.02k [00:00<00:00, 8.77MB/s]
config.json: 100%|███████████████████████| 877/877 [00:00<00:00, 4.47MB/s]
generation_config.json: 100%|████████████| 189/189 [00:00<00:00, 1.21MB/s]
consolidated.00.pth: 100%|██████████▉| 2.47G/2.47G [01:10<00:00, 35.1MB/s]
original/params.json: 100%|██████████████| 220/220 [00:00<00:00, 1.12MB/s]
tokenizer.model: 100%|███████████████| 2.18M/2.18M [00:00<00:00, 7.80MB/s]
special_tokens_map.json: 100%|███████████| 296/296 [00:00<00:00, 1.62MB/s]
tokenizer.json: 100%|████████████████| 9.09M/9.09M [00:00<00:00, 19.8MB/s]
tokenizer_config.json: 100%|█████████| 54.5k/54.5k [00:00<00:00, 51.7MB/s]
Model downloaded to /Users/puri/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/e9f8effbab1cbdc515c11ee6e098e3d5a9f51e14
Converting meta-llama/Meta-Llama-3.2-1B-Instruct to torchchat format...
known configs: ['llava-1.5', '13B', '70B', 'CodeLlama-7b-Python-hf', 'Meta-Llama-3.1-70B-Tune', '34B', 'Meta-Llama-3.1-8B', 'stories42M', 'Llama-Guard-3-1B', '30B', 'Meta-Llama-3.1-8B-Tune', 'stories110M', 'Llama-3.2-11B-Vision', 'Meta-Llama-3.2-3B', 'Meta-Llama-3.1-70B', 'Meta-Llama-3.2-1B', '7B', 'stories15M', 'Llama-Guard-3-1B-INT4', 'Mistral-7B', 'Meta-Llama-3-70B', 'Meta-Llama-3-8B']
Model config {'block_size': 131072, 'vocab_size': 128256, 'n_layers': 16, 'n_heads': 32, 'dim': 2048, 'hidden_dim': 8192, 'n_local_heads': 8, 'head_dim': 64, 'rope_base': 500000.0, 'norm_eps': 1e-05, 'multiple_of': 256, 'ffn_dim_multiplier': 1.5, 'use_tiktoken': True, 'max_seq_length': 8192, 'rope_scaling': {'factor': 32.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192}, 'n_stages': 1, 'stage_idx': 0, 'attention_bias': False, 'feed_forward_bias': False}
Symlinking checkpoint to /Users/puri/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/e9f8effbab1cbdc515c11ee6e098e3d5a9f51e14/model.pth.
Done.

Generate

python3 torchchat.py generate llama3.2-1b --prompt "Write a monologue from this opening line: 'Let me tell you what bugs me about human endeavor.'"   

Using checkpoint path: /Users/puri/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/e9f8effbab1cbdc515c11ee6e098e3d5a9f51e14/model.pth
Using checkpoint path: /Users/puri/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/e9f8effbab1cbdc515c11ee6e098e3d5a9f51e14/model.pth
Using device=mps 
Loading model...
Time to load model: 0.55 seconds
-----------------------------------------------------------
Write a monologue from this opening line: 'Let me tell you what bugs me about human endeavor.'"Let me tell you what bugs me about human endeavor. It's the capacity to create something, anything, that's not already out there, and to spend the vast majority of their time trying to make it better or more complex. It's the relentless pursuit of perfection, as if the only thing that matters is the product itself, not the journey.

We create for a reason, I suppose. We have emotions, desires, and problems to solve. And we find ways to tinker, to improvise, and to optimize. But the human problem-solving machine is a double-edged sword. On the one hand, it's incredibly resourceful and adaptable. We can solve complex systems, crack open hearts, and even bring forth unprecedented innovations.

But on the other hand, it's also a curse. We're obsessed with making something exactly right. We spend hours, days, even years tweaking and refining, only to have it devolve into something that's almost, but not quite,2024-10-09:13:31:17,670 INFO     [generate.py:1146] 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                
Generated 199 tokens                 
Time for inference 1: 5.7620 sec total                 
Time to first token: 0.2066 sec with parallel prefill.                

      Total throughput: 34.7103 tokens/sec, 0.0288 s/token                 
First token throughput: 4.8414 tokens/sec, 0.2066 s/token                 
 Next token throughput: 35.8208 tokens/sec, 0.0279 s/token                     
2024-10-09:13:31:17,671 INFO     [generate.py:1157] 
Bandwidth achieved: 104.03 GB/s
2024-10-09:13:31:17,671 INFO     [generate.py:1161] *** This first iteration will include cold start effects for dynamic import, hardware caches. ***

========================================


      Average tokens/sec (total): 34.71                 
Average tokens/sec (first token): 4.84                 
Average tokens/sec (next tokens): 35.82

Where

python3 torchchat.py where llama3.2-
1b

/Users/puri/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/e9f8effbab1cbdc515c11ee6e098e3d5a9f51e14

List

python3 torchchat.py list 

Model                                        Aliases                                                    Downloaded 
-------------------------------------------- ---------------------------------------------------------- -----------
meta-llama/llama-2-7b-hf                     llama2-base, llama2-7b                                                
meta-llama/llama-2-7b-chat-hf                llama2, llama2-chat, llama2-7b-chat                                   
meta-llama/llama-2-13b-chat-hf               llama2-13b-chat                                                       
meta-llama/llama-2-70b-chat-hf               llama2-70b-chat                                                       
meta-llama/meta-llama-3-8b                   llama3-base                                                           
meta-llama/meta-llama-3-8b-instruct          llama3, llama3-chat, llama3-instruct                       Yes        
meta-llama/meta-llama-3-70b-instruct         llama3-70b                                                            
meta-llama/meta-llama-3.1-8b                 llama3.1-base                                                         
meta-llama/meta-llama-3.1-8b-instruct        llama3.1, llama3.1-chat, llama3.1-instruct                            
meta-llama/meta-llama-3.1-70b-instruct       llama3.1-70b                                                          
meta-llama/meta-llama-3.1-8b-instruct-tune   llama3.1-tune, llama3.1-chat-tune, llama3.1-instruct-tune             
meta-llama/meta-llama-3.1-70b-instruct-tune  llama3.1-70b-tune                                                     
meta-llama/meta-llama-3.2-1b                 llama3.2-1b-base                                                      
meta-llama/meta-llama-3.2-1b-instruct        llama3.2-1b, llama3.2-1b-chat, llama3.2-1b-instruct        Yes        
meta-llama/llama-guard-3-1b                  llama3-1b-guard, llama3.2-1b-guard                                    
meta-llama/meta-llama-3.2-3b                 llama3.2-3b-base                                                      
meta-llama/meta-llama-3.2-3b-instruct        llama3.2-3b, llama3.2-3b-chat, llama3.2-3b-instruct                   
meta-llama/llama-3.2-11b-vision              llama3.2-11B-base, Llama-3.2-11B-Vision-base                          
meta-llama/llama-3.2-11b-vision-instruct     llama3.2-11B, Llama-3.2-11B-Vision, Llama-3.2-mm                      
meta-llama/codellama-7b-python-hf            codellama, codellama-7b                                               
meta-llama/codellama-34b-python-hf           codellama-34b                                                         
mistralai/mistral-7b-v0.1                    mistral-7b-v01-base                                                   
mistralai/mistral-7b-instruct-v0.1           mistral-7b-v01-instruct                                               
mistralai/mistral-7b-instruct-v0.2           mistral, mistral-7b, mistral-7b-instruct                              
openlm-research/open_llama_7b                open-llama, open-llama-7b                                             
stories15m                                                                                              Yes        
stories42m                                                                                                         
stories110m                                                                                             Yes

Remove

ls ~/.cache/huggingface/hub/
models--meta-llama--Llama-3.2-1B-Instruct
version.txt

python3 torchchat.py remove llama3.2-1b
Removing downloaded model artifacts for llama3.2-1b at /Users/puri/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct...
Done.


ls ~/.cache/huggingface/hub/
version.txt

Remove Again (file not present)

python3 torchchat.py remove llama3.2-1b
Model llama3.2-1b has no downloaded artifacts in /Users/puri/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct.

Oct 09 '24 20:10 vmpuri

torchchat torchchat copied to clipboard

Download Hugging Face models into Hugging Face cache

Testing

Download

Generate

Where

List

Remove

Remove Again (file not present)

torchchat
torchchat copied to clipboard