candle Optimize quantization process with QTensor::quantize

Optimize quantization process with QTensor::quantize_onto

Open EricLBuehler opened this issue 1 year ago • 0 comments

Motivation:

The current QTensor::quantize quantizes the src tensor onto the same device as src. This behavior is OK for most use cases, but there is a specific condition where this is problematic: anytime you are not quantizing a tensor on the CPU. This is the case because we only support quantization on the CPU.

To implement quantization on non-CPU device, we do the following:

Trigger a synchronizing dtoh copy here (same for Metal):

https://github.com/huggingface/candle/blob/main/candle-core/src/quantized/cuda.rs#L436-L441

Quantize on the CPU
Trigger a synchronizing htod copy here (same for Metal):

https://github.com/huggingface/candle/blob/main/candle-core/src/quantized/cuda.rs#L447

Because of the 2 copies and the fact that we are synchronizing the CUDA device (I'm not sure about the semantics for Metal, but we are certainly copying the data), this hurts performance!

The solution is a simple modification and introduction of a new API. This new API will take a CPU tensor, quantize it on the CPU, and then perform one htod synchronizing copy. This halves the data transfer/synchronizations which take place.

Aug 10 '24 00:08 EricLBuehler

candle candle copied to clipboard

Optimize quantization process with QTensor::quantize_onto

candle
candle copied to clipboard