candle
candle copied to clipboard
Add GGUF BF16 dtype support
Currently, the GgmlDType only supports F16 and not BF16. This PR introduces support for the BF16 type.
I would appreciate a check if this looks good! I have tested with success on my machine which has avx and f16c, and the CUDA tests also pass even though no changes were necessary.
I also noted that there will be a confusing situation in this case, though, if the tensor is part of a QMatMul. In this case (and for all other types not supported for quantized matmul in QStorage), we should perhaps dequantize and then perform the matmul using cublas? This modification could be made in QStorage::fwd, perhaps.