[BOUNTY - $100] Support running any model from huggingface
Like this: https://x.com/reach_vb/status/1846545312548360319
exo run hf. co/bartowski/Llama-3.2-1B-Instruct-GGUF:Q8_0
This should work out of the box with #139
Huggingface transformers can run gguf files but they first dequantize it to fp32 defeating the purpose altogether. We can directly run this on llama.cpp instead of using the hf/torch inference engine but I'm not quite sure about that yet.
PS: #335 is still WIP but can probably base this feature on that, I can work on accelerating the progress as far as llama.cpp inference is concerned.
@AlexCheema I would like to work on this. Please assign it to me
I assigned you both @komikat @AReid987 you both will receive the bounty for any meaningful work towards this - feel free to work independently or together, up to you,
hi @AlexCheema, llama.cpp seems to natively support sharding using gguf-split, could we just use that to shard the downloaded gguf and run it on connected nodes? I also feel we will need to do this on llama.cpp considering the huggingface method is to dequantise it which is suboptimal.
hi @AlexCheema, llama.cpp seems to natively support sharding using gguf-split, could we just use that to shard the downloaded gguf and run it on connected nodes? I also feel we will need to do this on llama.cpp considering the huggingface method is to dequantise it which is suboptimal.
exo supports multiple inference backends through the InferenceEngine interface. It's not enough to support just llama.cpp
I'm not sure if there is a way to run .gguf files on pytorch. Huggingface can be done but would have to be dequantised. Since there already is a huggingface inference engine I'd base this current feature on that; until llama.cpp inference comes around. How does this sound?
I'm not sure if there is a way to run .gguf files on pytorch. Huggingface can be done but would have to be dequantised. Since there already is a huggingface inference engine I'd base this current feature on that; until llama.cpp inference comes around. How does this sound?
Sure lets start with that.
I'm not sure if there is a way to run .gguf files on pytorch. Huggingface can be done but would have to be dequantised. Since there already is a huggingface inference engine I'd base this current feature on that; until llama.cpp inference comes around. How does this sound?
I'm using this library to parse the gguf files, it takes the array of bytes tensors and converts them to numpy arrays. If you intend to load the weights into Pytorch you could just convert the numpy arrays to torch tensors.
@bayedieng there's also llama.cpp to torch converter.
MLX has documentation on using gguf files for generation, will integrate this into exo for now.
My apologies. I was sidetracked by a dense build at a startup I was working a contract for. Finished up now and ready to work on this or anything else pressing. @AlexCheema @komikat
hey @AlexCheema i have solved this, Currently tested with smolLM llama model
We're closing bounties, thank you to all who contributed, and apologies that we left this unmanaged for so long.