Casper
Casper
I quantized llama 3 70B on 3x A6000 48GB. Did you adjust the calibration dataset?
Ahh I see the issue. This is a transformers issue where they have a memory leak in their cache. if you see the examples/quantize.py, we use in the use_cache: False...
One last thing that I noticed about your code that can cause OOM. You use `device_map='auto'` which makes accelerate fill all GPUs with the modeling. It's better to set this...
By the way, this is a known issue. AWQ batches 128 samples through the forward pass of the model at the same time. A fix is being worked on where...
Hi @ryanshrott, this is not implemented yet. PRs are welcome to enable this. I recommend using git until then
I would love to add DBRX support. However, at the moment, I lack the hardware/$ to experiment enough to implement quantization support for this model because of the sheer size...
CC @younesbelkada. Not sure if this would break anything in the transformers integration. WDYT?
> As a sanity check I would run basic inference with transformers after emrging this PR just to be sure, but looking at the PR it does not seem to...
> the ppl improvement is really small did you try other scores to see if this is worth it ? This is how it should have been implemented from the...
This is a vision model and it would be nice to integrate with LLaVa and others. Open to PRs that helps integrate it!