Eric Buehler

Results 543 comments of Eric Buehler

@TimDouglas2 tagging to let you know I am closing this.

This PR is based on the following reference ggml quantization/dequantization functions: Dequantization: https://github.com/ggml-org/llama.cpp/blob/7a2c913e66353362d7f28d612fd3c9d51a831eda/ggml/src/ggml-quants.c#L2434-L2475 Quantization: https://github.com/ggml-org/llama.cpp/blob/7a2c913e66353362d7f28d612fd3c9d51a831eda/ggml/src/ggml-quants.c#L4562-L4745 Vec dot: https://github.com/ggml-org/llama.cpp/blob/7a2c913e66353362d7f28d612fd3c9d51a831eda/ggml/src/ggml-cpu/ggml-cpu-quants.c#L11670-L12233

@Murad-Awad our SDPA impl is specialized for Metal currently, and only in the decode phase where there is no masking. For CUDA, the equivalent would most likely be to use...

Hi @polarathene thanks for the analysis, much appreciated 🫡! I will trigger a new build for v0.6.0 when I release it. I plan for that to be over the weekend,...

Hey @Murad-Awad! We have [candle-extensions](https://github.com/huggingface/candle-extensions) now, and you can use the [candle-flash-attn-v1](https://crates.io/crates/candle-flash-attn-v1) crate. The function is a [1:1 drop-in replacement](https://github.com/huggingface/candle-extensions/blob/612d5191f57bdc5b9a77659bc5834853793dc9fd/candle-flash-attn-v1/src/lib.rs#L252) for the v2 implementation here in Candle. Let me know...

Interesting find: F16 fails (produces NaN) on an A100, but not an H100.

@coreylowman sorry for not getting back! I am running this on my GPU and Pytorch can see it (`torch.cuda.is_available() == True`).

I am using `cuda-version-from-build-system` and `dynamic-linking`. How should I try dynamic loading?

Hmm yeah, same error. Current: ``` cudarc = { version = "0.11.5", features = ["std", "cublas", "cublaslt", "curand", "driver", "nvrtc", "f16", "cuda-12020"], default-features=false } ```

Looks exciting, I wonder how complicated it would be to develop based on Conv2d?