Casper
Casper
> RuntimeError: CUDA error: an illegal memory access was encountered Looks like you might be running out of memory. Which GPU are you using to load the model? EDIT: If...
> I tried shifting to CUDA-11.8 but facing the same error. Any insights regarding this would be really appreciated. Thanks I am not sure what your specific issue is. Can...
It looks like you are trying to modify `demo.py` and I can't be sure of exactly what is going on. I have been working on a refactoring of AWQ. Can...
> `demo.py` keeps the history so every consecutive prompt is adding to the maximum context length it can support and finally you will be able to see the error when...
I will investigate this in the future. You should be able to keep same context length without problems, maybe it's just something to do with how the config is being...
I have now investigated what is happening. Huggingface transformers/accelerate is not automatically loading the maximum sequence length into the model, causing some problems. I will aim to solve this in...
> We have integrated this incredible work into our project [LMDeploy](https://github.com/InternLM/lmdeploy) which completes the LLM deployment toolkit, including compressing, inference, and serving. > > Additionally, by extensively optimizing the W4A16...
> [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) may be a good repo to look at. First of all, there should be a `AutoModelForCausalLM.from_qunatized` method similar to `from_pretrained` method to load the AWQ models from a...
@abhinavkulkarni I have created a draft PR #72 that I have gotten pretty far with. I will likely need some people to test it as the code is semi-close to...
@benyang0506 I have an explanation to offer you. The benchmarks provided by this repository are correct but they fail to mention the CPU used to measure the latency, which plays...