infinity
infinity copied to clipboard
Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting a wide range of text-embedding models and frameworks.
Hoping to add a implementation of 4bit Bert, potentially in https://github.com/casper-hansen/AutoAWQ/pull/328. Contributions welcome
Adding doc for quantization / dtype
Hi! Kudos for this project Michael! It is amazing. We're migrating from a single repo with a RAG and and T40, to one repo with a RAG with just cpu...
I wonder if it would make sense to support compressed requests, esp. for /rerank, where the query and document list could be many 1k or 2k chunks of text? The...
This is a draft PR - unlikley to get merged. The performance overhead for inter-processes communication is too high.
Love the concept behind infinity! I wonder if you have a video tutorial or pdf about how to use Infinity? It will be great!
commit hash: 296472eefaa93c361f086ea26bd7cd7e3c6e9a3e I tried it on my Linux machne - Ubuntu 22.04 with CUDA 12.3, and it was failed. ``` % infinity_emb --device cuda --engine torch 2024-03-03 11:05:28.807 |...
Returning the actual token count that are used after truncating.
Currently allowing up to batch_size=64 as default. This can potentially lead to high memory usage, e.g. for jina-8k bert -> 64x8192. It would be better to adjust dynamically and set...
Please consider adding a parameter to set the number of decimals in the Json output. This would be beneficial to reduce network bandwidth requirements and the time for parsing the...