ExtReMLapin comments

Results 239 comments of


                                            ExtReMLapin

Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support

PR #19084 Fixes this issue. When working with contexts of 70k, with the model loaded + the context it uses something like 30Gb of vram, but during inference it goes...

No text extracted from DOCX image

I really don't get why no one yet opened a PR to replace the placeholders by the actual description/ocr

Add privacy mode that can disable mosaic and best model uploading

I wrote a script that remove some images of wandb ```python import wandb api = wandb.Api() entitiy = "your_wandb_username_or_organization" projects = api.projects(entitiy) def is_image_from_name(name): return name.endswith(".png") or name.endswith(".jpg") or name.endswith(".jpeg")...

Add support for XLMRoberta embedding models

What the the performance when comparing python transformers to llama.cpp run ?

Add support for XLMRoberta embedding models

Thanks for the answer, unless there is a typo somewhere, I expected it to be faster on llama.cpp

llama : combined beam search + grammar sampling strategy

Any news on this ? Also what @viantirreau suggested would be top notch.

Update README.md

Average 'I can't code but I want to be in the contributors list' PR

[Feature]: support tool and reasoning together

Not stale, I fixed this issue in llama.cpp but vllm has the same issue : https://github.com/vllm-project/vllm/blob/4fbd8bb597cf392b94def04a6955f22580356d76/vllm/entrypoints/openai/protocol.py#L712C9-L712C35 It's generating a json schema without allowing for the thinking tags Llama.cpp issue https://github.com/ggml-org/llama.cpp/issues/15247...

Any ZLUDA supporting plan ?

it's zluda's job to support ctranslate2

Fix build for kernel >= 6.11

merge it already