CTranslate2
CTranslate2 copied to clipboard
Fast inference engine for Transformer models
As CTranslate2 now supports quantized 8-bit LLMs like OPT, are there any plans to include model parallelism to split a model layers across multiple GPUs or GPU+CPU to meet the...
Hi, currently the decoder produces sentence level scores, instead of just outputting the average another option would be produce the score of each word/token. Beam search might be a harder...
Hello. So, I want to run NLLB-200 (3.3B) model on a server with 4x 3090, and a say, 16 core AMD Epyc cpu. I wrapped Ctranslate2 in fastAPI, running with...
Hello Authors, I apologise for asking questions unrelated to an issue with the repo however, would you consider support a newer paradigm I came across whilst reading a recent [paper](https://www.researchgate.net/publication/367557918_Understanding_INT4_Quantization_for_Transformer_Models_Latency_Speedup_Composability_and_Failure_Cases)?...
Related to #1349.
Related to #1349.
Related to #1349.
Hi, I want to use this lib to get encodings (on all positions) from flan T5 encoder on CPU. But I am not familiar with c++, so it is hard...
Context: With HF models, one can use [peft](https://github.com/huggingface/peft) to do parameter efficient tuning, the most popular (and afaik most performant) method being LoRa. Idea: It would be great to be...