canopy
canopy copied to clipboard
[Feature] vLLM Integration
Is this your first time submitting a feature request?
- [X] I have searched the existing issues, and I could not find an existing issue for this feature
- [X] I am requesting a straightforward extension of existing functionality
Describe the feature
It would be incredible if we could run local canopy with Mixtral 8x7b — which (afaik) would need GGUF quantized Mixtral via vLLM. This would open us up to integrations with things like formal grammars too (which again, afaik need local models, I don't think any API solutions accept it)
Holiday season is just around the corner and not sure if you guys got me anything so just putting this out there as an idea
Describe alternatives you've considered
No response
Who will this benefit?
The world, but primarily open LLM devs. Would probably be less production use, but I'm sure having this and being able to run for free (Pinecone free tier + local LLM) would push forward more devs building with canopy imo
Are you interested in contributing this feature?
maybe yes
Anything else?
Requires around 30GB of memory using GGUF quantized Mixtral https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF — this would fit on Mac M1/M2 chips
This would also get me to adopt canopy.
I have set my base_url and api_key to a local instance to an LLM, but I am not able to continue - even though the LLM uses the Open AI module.
I get a: TypeError: 'NoneType' object is not subscriptable error; however, when I use the real base url and api key, it works just fine.
Would love to see this adopted.
I have set my
base_urlandapi_keyto a local instance to an LLM, but I am not able to continue - even though the LLM uses the Open AI module.I get a:
TypeError: 'NoneType' object is not subscriptableerror; however, when I use the real base url and api key, it works just fine.
@rachfop can you please elaborate or provide a repro code?
For any model that follows the OpenAI APIs, this should actually work out of the box.
Totally think this should be considered! For two reasons:
- One shouldn't just rely on paid models where local, fine-tuned models will perform just as good or better.
- OpenAI embeddings are not great which can affect the accuracy of your RAG setup https://huggingface.co/spaces/mteb/leaderboard
Also perhaps the use of Ollama will be good as this creates a local server for the LLM.
Thoughts?
ollama uses openai api