exo icon indicating copy to clipboard operation
exo copied to clipboard

[BOUNTY - $100] Support running any model from huggingface

Open AlexCheema opened this issue 1 year ago • 12 comments

Like this: https://x.com/reach_vb/status/1846545312548360319

exo run hf. co/bartowski/Llama-3.2-1B-Instruct-GGUF:Q8_0

This should work out of the box with #139

AlexCheema avatar Oct 16 '24 19:10 AlexCheema

Huggingface transformers can run gguf files but they first dequantize it to fp32 defeating the purpose altogether. We can directly run this on llama.cpp instead of using the hf/torch inference engine but I'm not quite sure about that yet.

PS: #335 is still WIP but can probably base this feature on that, I can work on accelerating the progress as far as llama.cpp inference is concerned.

komikat avatar Oct 16 '24 23:10 komikat

@AlexCheema I would like to work on this. Please assign it to me

AReid987 avatar Oct 17 '24 02:10 AReid987

I assigned you both @komikat @AReid987 you both will receive the bounty for any meaningful work towards this - feel free to work independently or together, up to you,

AlexCheema avatar Oct 17 '24 04:10 AlexCheema

hi @AlexCheema, llama.cpp seems to natively support sharding using gguf-split, could we just use that to shard the downloaded gguf and run it on connected nodes? I also feel we will need to do this on llama.cpp considering the huggingface method is to dequantise it which is suboptimal.

komikat avatar Oct 17 '24 09:10 komikat

hi @AlexCheema, llama.cpp seems to natively support sharding using gguf-split, could we just use that to shard the downloaded gguf and run it on connected nodes? I also feel we will need to do this on llama.cpp considering the huggingface method is to dequantise it which is suboptimal.

exo supports multiple inference backends through the InferenceEngine interface. It's not enough to support just llama.cpp

AlexCheema avatar Oct 17 '24 19:10 AlexCheema

I'm not sure if there is a way to run .gguf files on pytorch. Huggingface can be done but would have to be dequantised. Since there already is a huggingface inference engine I'd base this current feature on that; until llama.cpp inference comes around. How does this sound?

komikat avatar Oct 17 '24 19:10 komikat

I'm not sure if there is a way to run .gguf files on pytorch. Huggingface can be done but would have to be dequantised. Since there already is a huggingface inference engine I'd base this current feature on that; until llama.cpp inference comes around. How does this sound?

Sure lets start with that.

AlexCheema avatar Oct 17 '24 19:10 AlexCheema

I'm not sure if there is a way to run .gguf files on pytorch. Huggingface can be done but would have to be dequantised. Since there already is a huggingface inference engine I'd base this current feature on that; until llama.cpp inference comes around. How does this sound?

I'm using this library to parse the gguf files, it takes the array of bytes tensors and converts them to numpy arrays. If you intend to load the weights into Pytorch you could just convert the numpy arrays to torch tensors.

bayedieng avatar Oct 21 '24 17:10 bayedieng

@bayedieng there's also llama.cpp to torch converter.

komikat avatar Oct 24 '24 08:10 komikat

MLX has documentation on using gguf files for generation, will integrate this into exo for now.

komikat avatar Oct 24 '24 12:10 komikat

My apologies. I was sidetracked by a dense build at a startup I was working a contract for. Finished up now and ready to work on this or anything else pressing. @AlexCheema @komikat

AReid987 avatar Dec 19 '24 11:12 AReid987

hey @AlexCheema i have solved this, Currently tested with smolLM llama model

Dead-Bytes avatar Mar 09 '25 06:03 Dead-Bytes

We're closing bounties, thank you to all who contributed, and apologies that we left this unmanaged for so long.

Evanev7 avatar Dec 18 '25 16:12 Evanev7