vllm
vllm copied to clipboard
Loading Models that require execution of third party code (trust_remote_code=True)
I am trying to load MPT using the AsyncLLMEngine:
engine_args = AsyncEngineArgs("mosaicml/mpt-7b-chat", engine_use_ray=True)
engine = AsyncLLMEngine.from_engine_args(engine_args)
But I am getting this error:
ValueError: Loading mosaicml/mpt-7b-chat-local requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option
trust_remote_code=True to remove this error.
Is there any workaround for this or could it be possible to add the option to trust remote code to EngineArgs?
Hi @nearmax-p, could you install vLLM from source? Then this error should disappear. Sorry for the inconvenience. We will update our pypi package very soon.
I see, thank you very much, this worked! One more issue I came across is that MPT-30B doesn't seem to load on 2 A100 GPUs.
I used the following command:
engine_args = AsyncEngineArgs("mosaicml/mpt-30b-chat", engine_use_ray=True, tensor_parallel_size=2)
engine = AsyncLLMEngine.from_engine_args(engine_args)
And got the following response: ```llm_engine.py:60] Initializing an LLM engine with config: model='mosaicml/mpt-30b-chat', tokenizer='mosaicml/mpt-30b-chat', tokenizer_mode=auto, dtype=torch.bfloat16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=2, seed=0)````
But the model is never loaded properly and can be called (I waited for 20+ minutes and the model was already downloaded from the huggingface hub on my device). Have you encountered this before?
@nearmax-p thanks for reporting it. Could you share how large your CPU memory is? It seems such a bug occurs when the CPU memory is not enough. We haven't succeeded reproducing the bug, so your information would be very helpful.
@WoosukKwon Sure! I am using an a2-highgpu-2g instance from gcp, so I have 170GB of CPU RAM. This actually seems like a lot to me
@nearmax-p Then it's very weird. We've tested the model on the exactly same setup. Which type of disk are you using? And if possible, could you re-install vLLM and try again?
@WoosukKwon Interesting. I am using a 500GB balanced persistent disk, but I doubt that this makes a difference. I will try to reinstall and let you know what happens. Thanks for the quick responses, really appreciate it!
@nearmax-p Thanks! That would be very helpful.
following up on the discussion. I incurred in the same problem trying to load xgen-7b-8k-inst (I am not sure it is supported but being based on llama I think it should)
I have installed vllm from source, as suggested, but when I run:
llm = LLM(model="xgen-7b-8k-inst")
I get:
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 669, in from_pretrained
raise ValueError(
ValueError: Loading /home/ec2-user/data/xgen-7b-8k-inst requires you to execute the tokenizer file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error.
where should I set trust_remote_code=True?
Any feedback would be very welcome :)
@WoosukKwon I tested my code after reinstalling vllm (0.1.2), unfortunately, nothing has changed. Maybe I should have mentioned that I am working from an nvidia pytorch Docker image. However, all other models run just fine.
@WoosukKwon now checking it outside of the container, will get back to you
@nearmax-p If you are using docker, could you try increasing the shared memory size (e.g., to 64G?)?
docker run --gpus all -it --rm --shm-size=64g nvcr.io/nvidia/pytorch:22.12-py3
@WoosukKwon alright, it doesn't seem to be related to RAM, but to distributed serving. Outside of the container, I am facing the same problem, even with mpt-7b, when I use tensor_parallel_size=2. With tensor_parallel_size=1, it works.
I've used the default packages that were installed after installing vllm, I've only uninstalled pydantic, but I'd assume that that doesn't cause any issues
@WoosukKwon Narrowed it down a bit. It is actually only a problem when using the AsyncLLMEngine.
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.sampling_params import SamplingParams
from vllm.utils import random_uuid
import asyncio
engine_args = AsyncEngineArgs(model="openlm-research/open_llama_7b", engine_use_ray=True)
engine = AsyncLLMEngine.from_engine_args(engine_args)
sampling_params = SamplingParams(max_tokens=200, top_p=0.8)
request_id = random_uuid()
results_generator = engine.generate("Hello, my name is Max and I am the founder of", sampling_params, request_id)
async def stream_results():
async for request_output in results_generator:
text_outputs = [output.text for output in request_output.outputs]
yield text_outputs
async def get_result():
async for s in stream_results():
print(s)
asyncio.run(get_result())
This script causes the issue. When writing an analogous script with the normal (non-async) LLMEngine, the issue didn't come up.
Hi @nearmax-p , we faced a similar issue - As a quick fix, setting engine_use_ray
to False
worked for us
Closing this issue as stale as there has been no discussion in the past 3 months.
If you are still experiencing the issue you describe, feel free to re-open this issue.