vllm Loading Models that require execution of third party code (trust_remote

I am trying to load MPT using the AsyncLLMEngine:


engine_args = AsyncEngineArgs("mosaicml/mpt-7b-chat", engine_use_ray=True)
engine = AsyncLLMEngine.from_engine_args(engine_args)

But I am getting this error: ValueError: Loading mosaicml/mpt-7b-chat-local requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option trust_remote_code=True to remove this error.

Is there any workaround for this or could it be possible to add the option to trust remote code to EngineArgs?

Jul 04 '23 08:07 nearmax-p

Hi @nearmax-p, could you install vLLM from source? Then this error should disappear. Sorry for the inconvenience. We will update our pypi package very soon.

Jul 04 '23 08:07 WoosukKwon

I see, thank you very much, this worked! One more issue I came across is that MPT-30B doesn't seem to load on 2 A100 GPUs.

I used the following command:

engine_args = AsyncEngineArgs("mosaicml/mpt-30b-chat", engine_use_ray=True, tensor_parallel_size=2)
engine = AsyncLLMEngine.from_engine_args(engine_args)

And got the following response: ```llm_engine.py:60] Initializing an LLM engine with config: model='mosaicml/mpt-30b-chat', tokenizer='mosaicml/mpt-30b-chat', tokenizer_mode=auto, dtype=torch.bfloat16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=2, seed=0)````

But the model is never loaded properly and can be called (I waited for 20+ minutes and the model was already downloaded from the huggingface hub on my device). Have you encountered this before?

Jul 04 '23 20:07 nearmax-p

@nearmax-p thanks for reporting it. Could you share how large your CPU memory is? It seems such a bug occurs when the CPU memory is not enough. We haven't succeeded reproducing the bug, so your information would be very helpful.

Jul 07 '23 07:07 WoosukKwon

@WoosukKwon Sure! I am using an a2-highgpu-2g instance from gcp, so I have 170GB of CPU RAM. This actually seems like a lot to me

Jul 07 '23 08:07 nearmax-p

@nearmax-p Then it's very weird. We've tested the model on the exactly same setup. Which type of disk are you using? And if possible, could you re-install vLLM and try again?

Jul 07 '23 08:07 WoosukKwon

@WoosukKwon Interesting. I am using a 500GB balanced persistent disk, but I doubt that this makes a difference. I will try to reinstall and let you know what happens. Thanks for the quick responses, really appreciate it!

Jul 07 '23 08:07 nearmax-p

@nearmax-p Thanks! That would be very helpful.

Jul 07 '23 08:07 WoosukKwon

following up on the discussion. I incurred in the same problem trying to load xgen-7b-8k-inst (I am not sure it is supported but being based on llama I think it should)

I have installed vllm from source, as suggested, but when I run:

llm = LLM(model="xgen-7b-8k-inst")

I get:

  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 669, in from_pretrained
    raise ValueError(
ValueError: Loading /home/ec2-user/data/xgen-7b-8k-inst requires you to execute the tokenizer file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error.

where should I set trust_remote_code=True?

Any feedback would be very welcome :)

Jul 07 '23 17:07 jth3galv

@WoosukKwon I tested my code after reinstalling vllm (0.1.2), unfortunately, nothing has changed. Maybe I should have mentioned that I am working from an nvidia pytorch Docker image. However, all other models run just fine.

Jul 08 '23 01:07 nearmax-p

@WoosukKwon now checking it outside of the container, will get back to you

Jul 08 '23 01:07 nearmax-p

@nearmax-p If you are using docker, could you try increasing the shared memory size (e.g., to 64G?)?

docker run --gpus all -it --rm --shm-size=64g nvcr.io/nvidia/pytorch:22.12-py3

Jul 08 '23 02:07 WoosukKwon

@WoosukKwon alright, it doesn't seem to be related to RAM, but to distributed serving. Outside of the container, I am facing the same problem, even with mpt-7b, when I use tensor_parallel_size=2. With tensor_parallel_size=1, it works.

I've used the default packages that were installed after installing vllm, I've only uninstalled pydantic, but I'd assume that that doesn't cause any issues

Jul 08 '23 05:07 nearmax-p

@WoosukKwon Narrowed it down a bit. It is actually only a problem when using the AsyncLLMEngine.

from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.sampling_params import SamplingParams
from vllm.utils import random_uuid
import asyncio

engine_args = AsyncEngineArgs(model="openlm-research/open_llama_7b", engine_use_ray=True)
engine = AsyncLLMEngine.from_engine_args(engine_args)

sampling_params = SamplingParams(max_tokens=200, top_p=0.8)
request_id = random_uuid()
results_generator = engine.generate("Hello, my name is Max and I am the founder of", sampling_params, request_id)

async def stream_results():
    async for request_output in results_generator:
        text_outputs = [output.text for output in request_output.outputs]
        yield text_outputs


async def get_result():
    async for s in stream_results():
        print(s)

asyncio.run(get_result())

This script causes the issue. When writing an analogous script with the normal (non-async) LLMEngine, the issue didn't come up.

Jul 08 '23 06:07 nearmax-p

Hi @nearmax-p , we faced a similar issue - As a quick fix, setting engine_use_ray to False worked for us

Jul 10 '23 21:07 justusmattern27

Closing this issue as stale as there has been no discussion in the past 3 months.

If you are still experiencing the issue you describe, feel free to re-open this issue.

Mar 08 '24 10:03 hmellor

vllm
vllm copied to clipboard

Loading Models that require execution of third party code (trust_remote_code=True)

vllm vllm copied to clipboard

Loading Models that require execution of third party code (trust_remote_code=True)

vllm
vllm copied to clipboard