Raushan Turganbay

Results 117 comments of Raushan Turganbay

Hi @andysingal. To use Kosmos-2 for image grounding, you have to add a special `` token before the prompt, as they do in the [paper](https://arxiv.org/abs/2306.14824). Also you can use ``...

As we discussed quantized cache can be started to be integrated to the library, given the results we got so far. All the possible speed optimizations/pre-fill stage optimizations can be...

Thanks for the comments! > except for guarding quanto imports (also I would say safer to make local imports whenever possible - e.g. at QuantCache init) Okey noted! > You...

@gante added benchmark results on the PR description. Right now int4 has almost same performance as fp16, sometimes a bit better. Also added some comparison with the KIVI paper.

I made the KV cache work with HQQ as a backend. It can be simply plugged in if a user writes their own "CacheClass". I am not planning to add...

@gante > This is with static cache AND compile, correct? Without compile it has no problems, correct? (I haven't seen them yet, if it happens without compile a reproduction example...

@gante as we discussed, I will not dig into the gibberish generation for fp32. In that case the PR should be ready to merge when we get the slow-test passing....

I think the cache problem should be fixed by converting `DynamicCache` back to legacy_cache in Idefics2's backbone language model, like it's already [done in llama](https://github.com/huggingface/transformers/blob/91d155ea92da372b319a79dd4eef69533ee15170/src/transformers/models/llama/modeling_llama.py#L1025-L1029). These changes are partially related...

We discussed this with @gante the cache input-output format yesterday. Maybe llama-format cache is not what we need, by anyway @gante will take care of it 😄

@amyeroberts I am not sure what should be the correct format of cache objects we return for language models since now we do not have consistency, so I wanted @gante...