Denis Mazur

Results 30 comments of Denis Mazur

Hey, @h9-tect, the notebook you pushed appears to be running out of memory. Is that still the case?

Hi! Sorry for the long reply. Running the model on multi-GPU is not currently supported. Currently, all active experts are sent to cuda:0. You can send an expert to a...

By the way, one of our quantization setups compressed the model to 17Gb. This would fit into the VRAM of two T4 GPUs, which you can get for free on...

>May I ask which quantization setup allowed compression down to 17Gb, or if you could point me to a file that contains that setup please? It's the 4-bit attention and...

> the model seems to only occupy ~11Gb on a single GPU without an OOM error, but then at inference there's no utilization of the GPU cores throughout (though the...

> Absolutely, what information are you looking for? A stacktrace would be helpful.

Hi! Full fine-tuning won't work as the model is quantized, but you could try fine-tuning the model using various PEFT techniques which work with quantized base models. Check out [QLoRA](https://github.com/artidoro/qlora)...

Hey, @nmarafo and @complete-dope! It looks like using huggingface's peft for fine-tuning the offloaded model is a bit tricky (due to custom layers mostly), but I haven't looked into it...

I'm not sure whether `(module.meta['shape'][1], module.meta['shape'][0])` is the correct shape. Maybe you should try pulling the correct shape from the [original model's config](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/blob/main/config.json). ```python from transformers import AutoConfig config =...

Hey! We are currently looking into other quantization approaches, both to improve inference speed and LM quality. How good is exl2's 2.4 quantization? 2.4 bits per parameters sounds like it...