candle Add GGUF BF16 dtype support

Add GGUF BF16 dtype support

Open EricLBuehler opened this issue 1 year ago • 0 comments

Currently, the GgmlDType only supports F16 and not BF16. This PR introduces support for the BF16 type.

I would appreciate a check if this looks good! I have tested with success on my machine which has avx and f16c, and the CUDA tests also pass even though no changes were necessary.

I also noted that there will be a confusing situation in this case, though, if the tensor is part of a QMatMul. In this case (and for all other types not supported for quantized matmul in QStorage), we should perhaps dequantize and then perform the matmul using cublas? This modification could be made in QStorage::fwd, perhaps.

Aug 02 '24 01:08 EricLBuehler

candle candle copied to clipboard

Add GGUF BF16 dtype support

candle
candle copied to clipboard