Denis Mazur comments

Results 30 comments of


                                            Denis Mazur

adding requirements.txt

Hey, @h9-tect, the notebook you pushed appears to be running out of memory. Is that still the case?

Can it run on multi-GPU?

Hi! Sorry for the long reply. Running the model on multi-GPU is not currently supported. Currently, all active experts are sent to cuda:0. You can send an expert to a...

Can it run on multi-GPU?

By the way, one of our quantization setups compressed the model to 17Gb. This would fit into the VRAM of two T4 GPUs, which you can get for free on...

Can it run on multi-GPU?

>May I ask which quantization setup allowed compression down to 17Gb, or if you could point me to a file that contains that setup please? It's the 4-bit attention and...

Can it run on multi-GPU?

> the model seems to only occupy ~11Gb on a single GPU without an OOM error, but then at inference there's no utilization of the GPU cores throughout (though the...

Can it run on multi-GPU?

> Absolutely, what information are you looking for? A stacktrace would be helpful.

Is it possible to finetune this on a custom dataset?

Hi! Full fine-tuning won't work as the model is quantized, but you could try fine-tuning the model using various PEFT techniques which work with quantized base models. Check out [QLoRA](https://github.com/artidoro/qlora)...

Is it possible to finetune this on a custom dataset?

Hey, @nmarafo and @complete-dope! It looks like using huggingface's peft for fine-tuning the offloaded model is a bit tricky (due to custom layers mostly), but I haven't looked into it...

Is it possible to finetune this on a custom dataset?

I'm not sure whether `(module.meta['shape'][1], module.meta['shape'][0])` is the correct shape. Maybe you should try pulling the correct shape from the [original model's config](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/blob/main/config.json). ```python from transformers import AutoConfig config =...

exl2

Hey! We are currently looking into other quantization approaches, both to improve inference speed and LM quality. How good is exl2's 2.4 quantization? 2.4 bits per parameters sounds like it...