AlphaAtlas

Results 24 comments of AlphaAtlas

Hmmm, does CLBlast reduce generation speed on IGPs now? I would think the transfers would be fine over 1 PCIe bus and to 1 IGP.

@ggerganov llama-cpp-python (which text gen ui uses) implements the additional caching: https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#L865

If y'all progressively get rid of blas libraries, cublas is probably lowest on the totem pole? AFAIK users still need the huge cuda toolkit to run cuda inference anyway, so...

This is normal for chats. Processing is slower and vram usage goes up as the context size (aka your chat history) increases. Also, the CPU/GPU tend to run at turbo...