fxmarty
fxmarty
Hi, @mszsorondo Looking into the PRs, BLIP has been implemented in https://github.com/huggingface/optimum/pull/1125. I just ticked it in the first post. @rajveer43 For Flava, there is this onging PR: https://github.com/huggingface/optimum/pull/907
Oh that's cool, thank you for sharing!
Marlin repacking kernel is integrated in https://github.com/AutoGPTQ/AutoGPTQ/pull/539, thank you @chu-tianxiang for the implementation!
@Qubitium I think `test_mixtral_generation` is flacky. `test_q4.py` is very slow for two reasons: it is using large models (7B, 13B), and more importantly some tests are on CPU only and...
Thank you, is there a way (i.e. non-private model) for me to reproduce & add a better test for this? What model architecture are you using? If llama, it could...
Yes Yi is llama architecture. So likely some breaking things in transformers :/
@Qubitium How long does the quantization take to reproduce on your private model? If you have time at hand, something you could try is to look at https://github.com/huggingface/transformers/commits/main/ between 1st...
@gante do you see any obvious issue in the linked commit? Degraded generation is reported with Transformers 4.39 using `generate` with a model whose linear layers are replaced, compared to...
@Qubitium Do you see the difference before/after 4.39 also with a simple forward call, or only with `generate` calls? In your comparison, if you compare generations, could you make sure...
> In a worst-case scenario, can I use from_pretrained in my application? Yes, it is fine to use just this. Unless you need to speed up inference, make things portable,...