vllm Loading models from an S3 location instead of local path

Discussed in https://github.com/vllm-project/vllm/discussions/3072

^{Originally posted by petrosbaltzis February 28, 2024} Hello,

The VLLM library gives the ability to load the model and the tokenizer either from a local folder or directly from HuggingFace.

["python", "-m", "vllm.entrypoints.openai.api_server", \
"--host=0.0.0.0", \
"--port=8080", \
"--model=<local_path>", \
"--tokenizer=<local_path>",
]

I wonder if this functionality can be extended to support s3 locations so that when we initialize the API server, we pass the proper S3 location.

["python", "-m", "vllm.entrypoints.openai.api_server", \
"--host=0.0.0.0", \
"--port=8080", \
"--model=<s3://bucket/prefix>", \
"--tokenizer=<s3://bucket/prefix>",
]

Petros

Feb 28 '24 18:02 simon-mo

Similar to what @ikalista mentioned in original discussion, imo a better way is to mount a model storage to the container for model loading unless we want to rewrite the model loader to directly "stream" from S3 to GPU buffer like what Anyscale did.

Feb 29 '24 01:02 ywang96

Sorry to bump an old issue here, but does this mean that --download-dir does not load weights? Because the docs say "Directory to download and load the weights, default to the default cache dir of huggingface." which makes me think that when I specify --download-dir s3://my-bucket that the bucket is used as a cache. But then this issue makes me think that my interpretation is incorrect?

Apr 25 '24 02:04 drawnwren

@ywang96 is anybody working on the direct model loading, do we have a benchmark between mounting and directly loading to memory? Happy to work on this if nobody else is.

Sep 24 '24 00:09 ashvinnihalani

@ywang96 is anybody working on the direct model loading, do we have a benchmark between mounting and directly loading to memory? Happy to work on this if nobody else is.

Not in my knowledge. Feel free to work on this and thanks for your interest!

Sep 24 '24 01:09 ywang96

@ashvinnihalani are you still working on this? This would be also be helpful to be able to load large models in environments where disk space isn't enough.

The issue with mounting object storage is that it requires the platform operator to provide this. For example, certain K8s setups the user deploying vLLM may not have the required permissions for mounting object storage in their container.

So that's why this would be a very valuable feature.

Nov 07 '24 14:11 samos123

Hey, At RunAI we had published an open source tool to stream model weights from an object store like S3 to GPU memory - called RunAI Model Streamer (https://github.com/run-ai/runai-model-streamer)

The Streamer gives 2 main advantages:

Reading from storage with concurrency
Integrating with object storage, S3

You can read further in the whitepaper: https://pages.run.ai/hubfs/PDFs/White%20Papers/Model-Streamer-Performance-Benchmarks.pdf

We have proposed a way to integrate it into vLLM. https://github.com/vllm-project/vllm/pull/10192

Nov 17 '24 06:11 omer-dayan

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Feb 16 '25 02:02 github-actions[bot]

I have been trying to use the vllm serve command to stream a deepseek model in from S3 and serve it with vllm running in a container on a inferentia2 node. However, the serve command always fails as it seems to go down the path of looking for a HF model. Is anyone able to advise if my command is misconfigured? Thanks!

Command: vllm serve s3://mys3bucket/path-to-model-dir --load-format runai_streamer --device neuron --tensor-parallel-size 2 --max-num-seqs 4 --block-size 8 --use-v2-block-manager --max-model-len 2048

In the container logs:

Traceback (most recent call last):
  File "/opt/conda/bin/vllm", line 8, in <module>
    sys.exit(main())
  File "/workspace/vllm/vllm/entrypoints/cli/main.py", line 73, in main
    args.dispatch_function(args)
  File "/workspace/vllm/vllm/entrypoints/cli/serve.py", line 34, in cmd
    uvloop.run(run_server(args))
  File "/opt/conda/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/opt/conda/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 929, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 139, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 220, in build_async_engine_client_from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/workspace/vllm/vllm/engine/arg_utils.py", line 1119, in create_engine_config
    model_config = self.create_model_config()
  File "/workspace/vllm/vllm/engine/arg_utils.py", line 1039, in create_model_config
    return ModelConfig(
  File "/workspace/vllm/vllm/config.py", line 304, in __init__
    hf_config = get_config(self.model, trust_remote_code, revision,
  File "/workspace/vllm/vllm/transformers_utils/config.py", line 257, in get_config
    raise ValueError(f"No supported config format found in {model}.")
ValueError: No supported config format found in /home/model-server/tmp/tmpz_tngsuy.

@omer-dayan have you seen anything like this before?

Feb 24 '25 14:02 mikeengland

Should this issue be closed? It looks like there are tests and it is documented so what is left?

Apr 09 '25 18:04 elatt

Similar issue as what @mikeengland has. Runai streamer downloads the model from s3 successfully, but then vllm cannot find a valid model in the downloaded location. It could be the model is downloaded somewhere else rather than advertised location in /tmp/... because I didn't see that directory in the vllm docker container during or after download.

vllm-1  | INFO 05-01 14:00:54 [cuda.py:220] Using Flash Attention backend on V1 engine.
vllm-1  | INFO 05-01 14:00:54 [gpu_model_runner.py:1174] Starting to load model /tmp/tmpvec_04ve...
vllm-1  | INFO 05-01 14:00:54 [topk_topp_sampler.py:53] Using FlashInfer for top-p & top-k sampling.
Loading safetensors using Runai Model Streamer:   0% Completed | 0/4 [00:00<?, ?it/s]
vllm-1  | [RunAI Streamer] CPU Buffer size: 4.7 GiB for file: model-00001-of-00004.safetensors
vllm-1  | Read throughput is 386.16 MB per second
Loading safetensors using Runai Model Streamer:  25% Completed | 1/4 [00:14<00:43, 14.53s/it]
vllm-1  | [RunAI Streamer] CPU Buffer size: 4.6 GiB for file: model-00002-of-00004.safetensors
vllm-1  | Read throughput is 324.39 MB per second
Loading safetensors using Runai Model Streamer:  50% Completed | 2/4 [00:31<00:31, 15.90s/it]
vllm-1  | [RunAI Streamer] CPU Buffer size: 4.5 GiB for file: model-00003-of-00004.safetensors
vllm-1  | Read throughput is 319.38 MB per second
Loading safetensors using Runai Model Streamer:  75% Completed | 3/4 [00:47<00:16, 16.17s/it]
vllm-1  | [RunAI Streamer] CPU Buffer size: 1.5 GiB for file: model-00004-of-00004.safetensors
vllm-1  | Read throughput is 340.36 MB per second
vllm-1  | [RunAI Streamer] Overall time to stream 15.2 GiB of all files: 53.97s, 288.6 MiB/s
Loading safetensors using Runai Model Streamer: 100% Completed | 4/4 [00:53<00:00, 12.19s/it]
Loading safetensors using Runai Model Streamer: 100% Completed | 4/4 [00:53<00:00, 13.49s/it]
vllm-1  |
vllm-1  | INFO 05-01 14:01:49 [gpu_model_runner.py:1186] Model loading took 15.6095 GB and 54.719272 seconds
vllm-1  | INFO 05-01 14:02:00 [backends.py:415] Using cache directory: /root/.cache/vllm/torch_compile_cache/1b68386e68/rank_0_0 for vLLM's t
orch.compile
vllm-1  | INFO 05-01 14:02:00 [backends.py:425] Dynamo bytecode transform time: 11.38 s
vllm-1  | INFO 05-01 14:02:03 [backends.py:132] Cache the graph of shape None for later use
vllm-1  | INFO 05-01 14:02:37 [backends.py:144] Compiling a graph for general shape takes 36.32 s
vllm-1  | INFO 05-01 14:02:56 [monitor.py:33] torch.compile takes 47.71 s in total
vllm-1  | INFO 05-01 14:02:57 [kv_cache_utils.py:566] GPU KV cache size: 287,792 tokens
vllm-1  | INFO 05-01 14:02:57 [kv_cache_utils.py:569] Maximum concurrency for 8,192 tokens per request: 35.13x
vllm-1  | INFO 05-01 14:03:31 [gpu_model_runner.py:1534] Graph capturing finished in 34 secs, took 0.93 GiB
vllm-1  | INFO 05-01 14:03:31 [core.py:151] init engine (profile, create kv cache, warmup model) took 102.44 seconds
vllm-1  | Traceback (most recent call last):
vllm-1  |   File "/usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py", line 424, in cached_files
vllm-1  |     hf_hub_download(
vllm-1  |   File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
vllm-1  |     validate_repo_id(arg_value)
vllm-1  |   File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
vllm-1  |     raise HFValidationError(
vllm-1  | huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/tmp/tmpvec_04ve'. Use
 `repo_type` argument if needed.

...

vllm-1  | OSError: Can't load the configuration of '/tmp/tmpvec_04ve'. If you were trying to load it from 'https://huggingface.co/models', ma
ke sure you don't have a local directory with the same name. Otherwise, make sure '/tmp/tmpvec_04ve' is the correct path to a directory conta
ining a generation_config.json file

...

vllm-1  | ValueError: Invalid repository ID or local directory specified: '/tmp/tmpvec_04ve'.
vllm-1  | Please verify the following requirements:
vllm-1  | 1. Provide a valid Hugging Face repository ID.
vllm-1  | 2. Specify a local directory that contains a recognized configuration file.
vllm-1  |    - For Hugging Face models: ensure the presence of a 'config.json'.
vllm-1  |    - For Mistral models: ensure the presence of a 'params.json'.

May 01 '25 21:05 alexlyzhov

I have another issue where the default runai streamer simply terminates my vLLM container after a certain (relatively short) time-period:

(VllmWorker rank=0 pid=783) Loading safetensors using Runai Model Streamer:   8% Completed | 9/118 [00:31<05:05,  2.80s/it]                                                                                                       │
│ (VllmWorker rank=0 pid=783) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00010-of-00118.safetensors                                                                                                                  │
│ Read throughput is 1.31 GB per second                                                                                                                                                                                             │
│ (VllmWorker rank=7 pid=790) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00010-of-00118.safetensors                                                                                                                  │
│ Read throughput is 1.94 GB per second                                                                                                                                                                                             │
│ (VllmWorker rank=3 pid=786) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00010-of-00118.safetensors                                                                                                                  │
│ Read throughput is 1.84 GB per second                                                                                                                                                                                             │
│ (VllmWorker rank=4 pid=787) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00010-of-00118.safetensors                                                                                                                  │
│ Read throughput is 1.60 GB per second                                                                                                                                                                                             │
│ Read throughput is 1.98 GB per second                                                                                                                                                                                             │
│ (VllmWorker rank=5 pid=788) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00010-of-00118.safetensors                                                                                                                  │
│ (VllmWorker rank=1 pid=784) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00010-of-00118.safetensors                                                                                                                  │
│ Read throughput is 1.94 GB per second                                                                                                                                                                                             │
│ (VllmWorker rank=0 pid=783) Loading safetensors using Runai Model Streamer:   8% Completed | 10/118 [00:33<04:40,  2.60s/it]                                                                                                      │
│ (VllmWorker rank=0 pid=783) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00011-of-00118.safetensors                                                                                                                  │
│ Read throughput is 1.57 GB per second                                                                                                                                                                                             │
│ Read throughput is 1.90 GB per second                                                                                                                                                                                             │
│ Read throughput is 1.06 GB per second                                                                                                                                                                                             │
│ (VllmWorker rank=2 pid=785) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00011-of-00118.safetensors                                                                                                                  │
│ (VllmWorker rank=7 pid=790) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00011-of-00118.safetensors                                                                                                                  │
│ (VllmWorker rank=6 pid=789) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00011-of-00118.safetensors                                                                                                                  │
│ Read throughput is 2.12 GB per second                                                                                                                                                                                             │
│ Read throughput is 1.70 GB per second                                                                                                                                                                                             │
│ (VllmWorker rank=5 pid=788) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00011-of-00118.safetensors                                                                                                                  │
│ (VllmWorker rank=3 pid=786) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00011-of-00118.safetensors                                                                                                                  │
│ Read throughput is 1.85 GB per second                                                                                                                                                                                             │
│ (VllmWorker rank=1 pid=784) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00011-of-00118.safetensors                                                                                                                  │
│ Read throughput is 1.19 GB per second                                                                                                                                                                                             │
│ (VllmWorker rank=4 pid=787) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00011-of-00118.safetensors                                                                                                                  │
│ stream closed EOF for vllm-servers/<model-name>-<uuid> (<pod-name>)

I have 500 GB of RAM on the Pod running that Container in my K8s so I doubt its OOMing unless the streamer does something weird.

Jun 27 '25 11:06 bredamatt

I have another issue where the default runai streamer simply terminates my vLLM container after a certain (relatively short) time-period:

(VllmWorker rank=0 pid=783) Loading safetensors using Runai Model Streamer:   8% Completed | 9/118 [00:31<05:05,  2.80s/it]                                                                                                       │
│ (VllmWorker rank=0 pid=783) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00010-of-00118.safetensors                                                                                                                  │
│ Read throughput is 1.31 GB per second                                                                                                                                                                                             │
│ (VllmWorker rank=7 pid=790) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00010-of-00118.safetensors                                                                                                                  │
│ Read throughput is 1.94 GB per second                                                                                                                                                                                             │
│ (VllmWorker rank=3 pid=786) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00010-of-00118.safetensors                                                                                                                  │
│ Read throughput is 1.84 GB per second                                                                                                                                                                                             │
│ (VllmWorker rank=4 pid=787) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00010-of-00118.safetensors                                                                                                                  │
│ Read throughput is 1.60 GB per second                                                                                                                                                                                             │
│ Read throughput is 1.98 GB per second                                                                                                                                                                                             │
│ (VllmWorker rank=5 pid=788) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00010-of-00118.safetensors                                                                                                                  │
│ (VllmWorker rank=1 pid=784) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00010-of-00118.safetensors                                                                                                                  │
│ Read throughput is 1.94 GB per second                                                                                                                                                                                             │
│ (VllmWorker rank=0 pid=783) Loading safetensors using Runai Model Streamer:   8% Completed | 10/118 [00:33<04:40,  2.60s/it]                                                                                                      │
│ (VllmWorker rank=0 pid=783) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00011-of-00118.safetensors                                                                                                                  │
│ Read throughput is 1.57 GB per second                                                                                                                                                                                             │
│ Read throughput is 1.90 GB per second                                                                                                                                                                                             │
│ Read throughput is 1.06 GB per second                                                                                                                                                                                             │
│ (VllmWorker rank=2 pid=785) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00011-of-00118.safetensors                                                                                                                  │
│ (VllmWorker rank=7 pid=790) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00011-of-00118.safetensors                                                                                                                  │
│ (VllmWorker rank=6 pid=789) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00011-of-00118.safetensors                                                                                                                  │
│ Read throughput is 2.12 GB per second                                                                                                                                                                                             │
│ Read throughput is 1.70 GB per second                                                                                                                                                                                             │
│ (VllmWorker rank=5 pid=788) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00011-of-00118.safetensors                                                                                                                  │
│ (VllmWorker rank=3 pid=786) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00011-of-00118.safetensors                                                                                                                  │
│ Read throughput is 1.85 GB per second                                                                                                                                                                                             │
│ (VllmWorker rank=1 pid=784) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00011-of-00118.safetensors                                                                                                                  │
│ Read throughput is 1.19 GB per second                                                                                                                                                                                             │
│ (VllmWorker rank=4 pid=787) [RunAI Streamer] CPU Buffer size: 3.7 GiB for file: model-00011-of-00118.safetensors                                                                                                                  │
│ stream closed EOF for vllm-servers/<model-name>-<uuid> (<pod-name>)

I have 500 GB of RAM on the Pod running that Container in my K8s so I doubt its OOMing unless the streamer does something weird.

When I was testing, seemed like entire model got loaded into RAM before copying into VRAM. Iirc I was hitting >700GB RAM usage loading Llama 4 Scout

This was on AWS and the instance already had NVMe so ended up being faster using s5cmd to pull to local storage (was hitting about 6-8GiB/s download from S3) then use default loader

Jun 27 '25 13:06 nijave

@nijave wow that is quite hectic. This model is much larger 235B. I have my object store on NVMe in the same network, so its technically already on disk, I just need to load it from that object store essentially.

Jun 27 '25 17:06 bredamatt

@nijave in my case this was actually just a healthcheck that failed - and it turned out to work perfectly without using sharding by simply setting the health and liveness checks up correctly.

Jun 30 '25 12:06 bredamatt

@nijave in my case this was actually just a healthcheck that failed - and it turned out to work perfectly without using sharding by simply setting the health and liveness checks up correctly.

Oh yeah that's a good callout--forgot I had to add a long startupProbe on our k8s setup. We also use Istio and I had to make the sidecar significantly bigger (like 32 CPU, 8GB) and that allowed pulling from S3 (we're on AWS) 6-8GiB/s versus 300MiB/s--something else to double-check if you have mesh or proxies

Jun 30 '25 16:06 nijave

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Sep 29 '25 02:09 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Oct 29 '25 02:10 github-actions[bot]