OpenGVLab/InternVL3_5-241B-A28B fails to load on 16× A100 40GB GPUs (OOM issue)

Open y-rok opened this issue 3 months ago • 0 comments

I am trying to load InternVL3_5-241B-A28B for inference, but I always encounter an out-of-memory (OOM) error. Here is the script I used: lmdeploy serve api_server OpenGVLab/InternVL3_5-241B-A28B
--server-port 23333
--tp 16
--backend turbomind
--cache-max-entry-count 0.05
--session_len 512
--dtype bfloat16
--quant-policy 4 According to Table 18 of the InternVL3.5 paper, 8× A100 GPUs should be sufficient to run inference on InternVL3_5-241B-A28B. I would like to understand whether there is an issue with my script that causes the OOM error.

Sep 10 '25 10:09 y-rok