su-park
su-park
I encountered the same oom error message and I guess there is still no other solution.. - model: Llama-2-7b - cuda version: 12.2 - vllm version: 0.3.0 - multi gpus...
I resolved my case by `enforce_eager=True` with slower generations. Thank you all.
Even though I replaced `adapter_model.bin` with a checkpoint binary as @kuan-cresta mentioned, there have been some improvements, but the same issues persist. Do you have any more suggestions?
Hello. It seems like a question related to the above issue, so I'm inquiring about it together below. Currently, we are conducting inference using Mistral 7B model with V100 16GB...