exo [BOUNTY - $100] Support running any model from huggingface

Like this: https://x.com/reach_vb/status/1846545312548360319

exo run hf. co/bartowski/Llama-3.2-1B-Instruct-GGUF:Q8_0

This should work out of the box with #139

Oct 16 '24 19:10 AlexCheema

Huggingface transformers can run gguf files but they first dequantize it to fp32 defeating the purpose altogether. We can directly run this on llama.cpp instead of using the hf/torch inference engine but I'm not quite sure about that yet.

PS: #335 is still WIP but can probably base this feature on that, I can work on accelerating the progress as far as llama.cpp inference is concerned.

Oct 16 '24 23:10 komikat

@AlexCheema I would like to work on this. Please assign it to me

Oct 17 '24 02:10 AReid987

I assigned you both @komikat @AReid987 you both will receive the bounty for any meaningful work towards this - feel free to work independently or together, up to you,

Oct 17 '24 04:10 AlexCheema

hi @AlexCheema, llama.cpp seems to natively support sharding using gguf-split, could we just use that to shard the downloaded gguf and run it on connected nodes? I also feel we will need to do this on llama.cpp considering the huggingface method is to dequantise it which is suboptimal.

Oct 17 '24 09:10 komikat

hi @AlexCheema, llama.cpp seems to natively support sharding using gguf-split, could we just use that to shard the downloaded gguf and run it on connected nodes? I also feel we will need to do this on llama.cpp considering the huggingface method is to dequantise it which is suboptimal.

exo supports multiple inference backends through the InferenceEngine interface. It's not enough to support just llama.cpp

Oct 17 '24 19:10 AlexCheema

I'm not sure if there is a way to run .gguf files on pytorch. Huggingface can be done but would have to be dequantised. Since there already is a huggingface inference engine I'd base this current feature on that; until llama.cpp inference comes around. How does this sound?

Oct 17 '24 19:10 komikat

I'm not sure if there is a way to run .gguf files on pytorch. Huggingface can be done but would have to be dequantised. Since there already is a huggingface inference engine I'd base this current feature on that; until llama.cpp inference comes around. How does this sound?

Sure lets start with that.

Oct 17 '24 19:10 AlexCheema

I'm not sure if there is a way to run .gguf files on pytorch. Huggingface can be done but would have to be dequantised. Since there already is a huggingface inference engine I'd base this current feature on that; until llama.cpp inference comes around. How does this sound?

I'm using this library to parse the gguf files, it takes the array of bytes tensors and converts them to numpy arrays. If you intend to load the weights into Pytorch you could just convert the numpy arrays to torch tensors.

Oct 21 '24 17:10 bayedieng

@bayedieng there's also llama.cpp to torch converter.

Oct 24 '24 08:10 komikat

MLX has documentation on using gguf files for generation, will integrate this into exo for now.

Oct 24 '24 12:10 komikat

My apologies. I was sidetracked by a dense build at a startup I was working a contract for. Finished up now and ready to work on this or anything else pressing. @AlexCheema @komikat

Dec 19 '24 11:12 AReid987

hey @AlexCheema i have solved this, Currently tested with smolLM llama model

Mar 09 '25 06:03 Dead-Bytes

We're closing bounties, thank you to all who contributed, and apologies that we left this unmanaged for so long.

Dec 18 '25 16:12 Evanev7