flux-fp8-api icon indicating copy to clipboard operation
flux-fp8-api copied to clipboard

The possibility of supporting GPUs with other architectures

Open ziyaxuanyi opened this issue 1 year ago • 1 comments

Can I extend the support for graphics cards with other architectures, such as the 3090? I tested on the 3090 and found that FP8 quantization not only fails to accelerate the model, but also slows down the inference speed significantly

ziyaxuanyi avatar Nov 11 '24 03:11 ziyaxuanyi

Well, fp8 matmul is only possible on ADA devices, since there are cuda instructions for performing matrix multiplication with those tensors. If you don't have an ADA device, then the only thing that it can do is dequantize the tensor into float16, bfloat16 or float32 and then afterwards do a matrix multiplication, which would of course be significantly slower than a direct matrix multiplication on fp8 tensors. For a 3090, that is the only way to use a float8 tensor.

aredden avatar Nov 14 '24 17:11 aredden