CTranslate2 icon indicating copy to clipboard operation
CTranslate2 copied to clipboard

Int4 Support

Open fmac2000 opened this issue 1 year ago • 6 comments

Hello Authors,

I apologise for asking questions unrelated to an issue with the repo however, would you consider support a newer paradigm I came across whilst reading a recent paper?

It looks incredibly promising and rather well written I must say, especially when considering the performance of such a precision. Is there anyone on the team able to give this a shot?

fmac2000 avatar Feb 28 '23 13:02 fmac2000

Hello,

Thank you for sharing this paper!

At this time I don't plan on integrating INT4 which would require using CUTLASS to define custom kernels. We are currently using cuBLAS for matrix multiplication.

guillaumekln avatar Mar 09 '23 16:03 guillaumekln

Would it be reasonable to implement this as a CPU-only optimization? GGML supports this on CPU, but I'm not sure if that approach makes sense here or not.

jncraton avatar Jun 17 '23 21:06 jncraton

Hi,

Would be great to have the possibility to integrate int4 quantization regarding the very interesting results in terms of performance and inference!

Matthieu-Tinycoaching avatar Jun 22 '23 11:06 Matthieu-Tinycoaching

I see that the last few versions of opennmt have added support for 4bit and other quantization methods. https://forum.opennmt.net/t/opennmt-py-v3-3-released-following-3-2-with-plenty-of-new-features/5366

Might any of that be integrated into CTranslate2?

nickchomey avatar Sep 14 '23 13:09 nickchomey

@guillaumekln Yes, 4bit quantization (on cpu) is a very much required feature. Any plans of taking this up?

bil-ash avatar Apr 08 '24 01:04 bil-ash

Or maybe @ebraraktas can go one step further and implement 2bit and 3bit quantization using by taking clues from https://github.com/intel/neural-speed/pull/178

bil-ash avatar Apr 08 '24 02:04 bil-ash