candle Quantization issue

trafficstars

Hi all. I'm currently working to the implementation of a quantized version of Mixtral 8x22b. I'm using the weights from the following repo: MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-GGUF.

Unfortunately, expert-related tensors for each layer are merged (instead of having one for each expert). In order to use the available Mixtral implementation and exploit experts routing I:

dequantize the qtensor;
chunk the tensor obtained along dimension 0 and squeeze;
quantize again to resume the conventional flow.

The problem arise during the quantization step. Considering quantize in QTensor implementation:


    pub fn quantize(src: &Tensor, dtype: GgmlDType) -> Result<Self> {
        let shape = src.shape();
        let block_size = dtype.block_size();
        check_shape(shape, block_size)?;
        let src = src.to_dtype(crate::DType::F32)?.flatten_all()?;
        let elem_count = shape.elem_count();
        if elem_count % block_size != 0 {
            crate::bail!(
                "tensor size ({shape:?}) is not divisible by block size {}",
                block_size
            )
        }
        let mut storage = src.device().qzeros(elem_count, dtype)?;
        storage.quantize(&src.storage())?;
        Ok(Self {
            storage,
            shape: shape.clone(),
        })
    }

and quantize in QStorage implementation:


fn quantize(&mut self, src: &Storage) -> Result<()> {
        match (self, src) {
            (QStorage::Cpu(storage), Storage::Cpu(src)) => {
                storage.from_float(src.as_slice::<f32>()?)?;
            }
            (QStorage::Metal(storage), Storage::Metal(src)) => storage.quantize(src)?,
            (QStorage::Cuda(storage), Storage::Cuda(src)) => storage.quantize(src)?,
            _ => crate::bail!("Invalid dequantize storage locations do not match"),
        }
        Ok(())
    }

I found that the length of src.as_slice::<f32> is 8 times elem_count, causing problems with subsequent checks. Any suggestions as to why this happens? Does the chunk operation have any influence?

May 21 '24 16:05 edesalve

Interesting, thanks for reporting this. It's a bit sad that there are multiple different gguf conventions for mixtral, e.g. the files from thebloke seem to have one tensor per expert which should make things a lot easier to use. If you really want to use these other files, the easier would likely be to slice the quantized tensor directly, this would require a couple changes on the candle side though, not sure if you've give this a try. The dequantize-chunk-quantize approach should work, the storage size is indeed 8 times elem_count as the chunks are just views over the original tensor. You can trigger an actual copy by calling copy on the tensor then the storage will be adjusted to the proper size but this has some cost for copying the elements over.

May 21 '24 19:05 LaurentMazare

Yes, I used TheBloke weights for mixtral8x7b. Moreover the files from the MaziyarPanahi are split with llama.cpp util gguf-split. Therefore uploading via hf_hub is not viable (splitting of TheBloke weights is a simple division of bytes, whereas MaziyarPanah used gguf-split to split the weights into 2 gguf files with their own metadata).

Concerning the quantization issue a simple call to copy wasn't enough. I solved with force_contiguous.

I have been wanting to work on the QTensor and QStorage api for a while (e.g., complete the implementation of data for QStorage, provide an implementation of to_device and other classic operations such as narrow) since I make great use of them split in shards.

I hope to find the time to work on it and create a PR.

May 25 '24 18:05 edesalve

candle
candle copied to clipboard

Quantization issue - Mixtral 8x22b

candle candle copied to clipboard

Quantization issue - Mixtral 8x22b

candle
candle copied to clipboard