ragas icon indicating copy to clipboard operation
ragas copied to clipboard

Support for Async Embeddings via michaelfeil/infinity

Open michaelfeil opened this issue 1 year ago • 10 comments
trafficstars

Describe the Feature I would like to integrate https://github.com/michaelfeil/infinity for embeddings inference. This would automatically batch up concurrent request, uses flash-attention2, compatible with cuda, rocm, apple mps and cpu. Depending on the usage, you might expect between a 2.5x-22x throughput improvment / speedup over using the default hf embeddings langchain code.

michaelfeil avatar Feb 13 '24 08:02 michaelfeil

Hey @michaelfeil , thank you for sharing this man. We would love to have this, are you interested in working on this?

shahules786 avatar Feb 13 '24 17:02 shahules786

@shahules786 Perhaps I'll just push the pure python / async into langchain directly, then it should be reusable, right?

https://github.com/explodinggradients/ragas/blob/41c0c286ae3632a77db13ddd265b7699fe6a4adc/src/ragas/embeddings/base.py#L45C1-L78C61

michaelfeil avatar Feb 13 '24 18:02 michaelfeil

he @michaelfeil this would be awesome, like you said if you drop something in langchain, that will be the easiest for you in terms of time spent. What we would love to do is build an integration doc with infinity and showcase how fast it is and how it improves people who are using Ragas as well, hopefully, driving some traffic your way.

if you check this section we use embed a lot of chunks in sequence and is limiting on how your embedding is served. Maybe we can do a comparison here? would that be something interested?

https://github.com/explodinggradients/ragas/blob/27e1c24e4b53f8f873aa4a15db90f4ee3125c805/src/ragas/testset/docstore.py#L229-L250

we can do other comparisons too but the LLM is the limiting factor for performance so there won't be a lot of diff but the above usecase would be solid for a comparison

let me know if its something that interests you :)

jjmachan avatar Feb 15 '24 22:02 jjmachan

Would be interesting. Fyi, I added the PR for langchain here, took me some hours over the weekend, hope its getting merged soon. https://github.com/langchain-ai/langchain/pull/17671

I would not recommend submitting the nodes (assuming each node has 1 sentence) with ThreadPoolExecutor. At a minimum batch the requests, this will help whatever backend, even API's.

Also is using async def an option for the function you linked above @jjmachan ?

michaelfeil avatar Feb 20 '24 03:02 michaelfeil

FYI all the thing is now finally in langchain (community, see PR mentined above). Also, you might be interested in in https://github.com/michaelfeil/infinity/blob/1fe3a34e295c95fc4a75297de842ec55c6761457/docs/benchmarks/benchmarking.md for benchmarking.

michaelfeil avatar Feb 22 '24 05:02 michaelfeil

@jjmachan It should be now in some versions of langchain.

michaelfeil avatar Feb 26 '24 06:02 michaelfeil

Hey all, looking forward to contribute this.

michaelfeil avatar Apr 02 '24 04:04 michaelfeil

Nah, not stale!

michaelfeil avatar Jun 01 '24 00:06 michaelfeil

I am still waiting for a freaking PR review

michaelfeil avatar Jun 01 '24 00:06 michaelfeil

hey @michaelfeil - extremely sorry about this 🙁 - reopening this now and reviewing your PR right now

we have been a bit slow the last couple of months which is why this slipped through the cracks - again extremely sorry for this

jjmachan avatar Jul 03 '24 04:07 jjmachan