Casper comments

Results 295 comments of


                                            Casper

Reduce the amount of gpu memory used in the quantification process

I quantized llama 3 70B on 3x A6000 48GB. Did you adjust the calibration dataset?

Reduce the amount of gpu memory used in the quantification process

Ahh I see the issue. This is a transformers issue where they have a memory leak in their cache. if you see the examples/quantize.py, we use in the use_cache: False...

Reduce the amount of gpu memory used in the quantification process

One last thing that I noticed about your code that can cause OOM. You use `device_map='auto'` which makes accelerate fill all GPUs with the modeling. It's better to set this...

Reduce the amount of gpu memory used in the quantification process

By the way, this is a known issue. AWQ batches 128 samples through the forward pass of the model at the same time. A fix is being worked on where...

push_to_hub error

Hi @ryanshrott, this is not implemented yet. PRs are welcome to enable this. I recommend using git until then

Add DBRX Support

I would love to add DBRX support. However, at the moment, I lack the hardware/$ to experiment enough to implement quantization support for this model because of the sheer size...

Mixtral scaling: Reduce perplexity from 4.294 to 4.269

CC @younesbelkada. Not sure if this would break anything in the transformers integration. WDYT?

Mixtral scaling: Reduce perplexity from 4.294 to 4.269

> As a sanity check I would run basic inference with transformers after emrging this PR just to be sure, but looking at the PR it does not seem to...

Mixtral scaling: Reduce perplexity from 4.294 to 4.269

> the ppl improvement is really small did you try other scores to see if this is worth it ? This is how it should have been implemented from the...

Support Kosmos-2

This is a vision model and it would be nice to integrate with LLaVa and others. Open to PRs that helps integrate it!