ragas
ragas copied to clipboard
Support for Async Embeddings via michaelfeil/infinity
Describe the Feature I would like to integrate https://github.com/michaelfeil/infinity for embeddings inference. This would automatically batch up concurrent request, uses flash-attention2, compatible with cuda, rocm, apple mps and cpu. Depending on the usage, you might expect between a 2.5x-22x throughput improvment / speedup over using the default hf embeddings langchain code.
Hey @michaelfeil , thank you for sharing this man. We would love to have this, are you interested in working on this?
@shahules786 Perhaps I'll just push the pure python / async into langchain directly, then it should be reusable, right?
https://github.com/explodinggradients/ragas/blob/41c0c286ae3632a77db13ddd265b7699fe6a4adc/src/ragas/embeddings/base.py#L45C1-L78C61
he @michaelfeil this would be awesome, like you said if you drop something in langchain, that will be the easiest for you in terms of time spent. What we would love to do is build an integration doc with infinity and showcase how fast it is and how it improves people who are using Ragas as well, hopefully, driving some traffic your way.
if you check this section we use embed a lot of chunks in sequence and is limiting on how your embedding is served. Maybe we can do a comparison here? would that be something interested?
https://github.com/explodinggradients/ragas/blob/27e1c24e4b53f8f873aa4a15db90f4ee3125c805/src/ragas/testset/docstore.py#L229-L250
we can do other comparisons too but the LLM is the limiting factor for performance so there won't be a lot of diff but the above usecase would be solid for a comparison
let me know if its something that interests you :)
Would be interesting. Fyi, I added the PR for langchain here, took me some hours over the weekend, hope its getting merged soon. https://github.com/langchain-ai/langchain/pull/17671
I would not recommend submitting the nodes (assuming each node has 1 sentence) with ThreadPoolExecutor. At a minimum batch the requests, this will help whatever backend, even API's.
Also is using async def an option for the function you linked above @jjmachan ?
FYI all the thing is now finally in langchain (community, see PR mentined above). Also, you might be interested in in https://github.com/michaelfeil/infinity/blob/1fe3a34e295c95fc4a75297de842ec55c6761457/docs/benchmarks/benchmarking.md for benchmarking.
@jjmachan It should be now in some versions of langchain.
Hey all, looking forward to contribute this.
Nah, not stale!
I am still waiting for a freaking PR review
hey @michaelfeil - extremely sorry about this 🙁 - reopening this now and reviewing your PR right now
we have been a bit slow the last couple of months which is why this slipped through the cracks - again extremely sorry for this