OlivierDehaene

Results 119 comments of OlivierDehaene

It is possible that this is not the root cause but there is an issue with these lines: ```python offset = 0 if has_layer_past: offset = layer_past[0].shape[-2] seq_len += offset...

We use padding left extensively on the serving side as we have a dynamic batching logic that batches sequence of very different lengths together. While the pad==256 example above seems...

Hi @njhill! Nice thanks for working on this! For now I have a fix on my text-generation-inference fork as we have multiple neox in prod and I need a fix...

Do you happen to have a AMD CPU?

@LLukas22 can you share more on this? The bert CPU impl is almost exactly the same as the one in Candle Transformers. This might be only linked to the default...

Oh that's expected given your gist: TEI does not batch on CPU (yet). That's a different issue alltogether. Here the main problem is that MKL's sgemm is slower than whatever...

#35 helps on AMD CPUs (20% faster on average) but it shouldn't really make a diff on Intel ones besides making it clear to MKL that we want to use...

What do you need exactly for it to be supported? Is supporting the embeddings per token with compression enough?

This is definitely on our roadmap and will be tackled in the coming weeks. Here are the priorities right now: 1. re-write the scheduling code and cache multi-turn conversations. This...

It's possible that you are just missing the http:// header: `OTLP_ENDPOINT: http://tempo.monitoring:4317`. The traces you see are from the Python server, but it doesn't seem to collect the traces from...