Is possible to run this model with ROCm?
I have a 7900 XTX, a ROCm enabled card with 24GB of VRAM I tried install vLLM but it just got OOM maybe because it was no flash-attention support.
tried ollama with GGUF version, but it just returns me empty response.
+1 I'd also would run it on CPU or ROCm
And updates on this ?
We released FP8 version of the olmOCR. You can try that out.
Yes, this runs on ROCM, I have a few experiments I have tried. I don't have official instructions to share, but you will need to install the rocm vllm and it should pretty much work.
Yes, this runs on ROCM, I have a few experiments I have tried. I don't have official instructions to share, but you will need to install the rocm vllm and it should pretty much work.
I had problem as well to install in RDNA3 due atten not has proper support to my arch, but cool to known is working, closing the issue in this case.
Hi @jakep-allenai, Thanks for confirming that olmOCR runs on ROCm with the ROCm vLLM. I'm trying to set it up on AMD RX 7900 XTX with ROCm 6.2 on Ubuntu 22.04 in Docker, but I'm hitting persistent errors in vLLM's setup.py (e.g., AssertionError: CUDA_HOME is not set or TypeError: unsupported operand type(s) for +: 'NoneType' and 'str' despite patches to bypass CUDA checks). My setup:
Base image: rocm/dev-ubuntu-22.04:6.2 PyTorch: 2.3.1 (from https://download.pytorch.org/whl/rocm6.2/) vLLM: Built from source (tried v0.6.2, v0.6.1) with VLLM_BUILD_ROCM=1 and patches to setup.py (bypassing CUDA_HOME and get_nvcc_cuda_version). olmOCR: Installed with [gpu] --no-deps to avoid CUDA vLLM. Environment variables: ROCM_HOME=/opt/rocm, HSA_OVERRIDE_GFX_VERSION=11.0.0, HIP_VISIBLE_DEVICES=0. rocm-smi detects the GPU, but torch.cuda.is_available() returns False with 'No NVIDIA driver' error.
Can you share your exact ROCm vLLM setup? Specifically:
vLLM version or branch? PyTorch/ROCm versions? Any patches to vLLM's setup.py or olmOCR's check.py? Docker or host setup details? Thanks for your help!
Have you tried this image? https://hub.docker.com/layers/rocm/vllm/rocm6.4.1_vllm_0.9.1_20250715/images/sha256-4a429705fa95a58f6d20aceab43b1b76fa769d57f32d5d28bd3f4e030e2a78ea
@haydn-jones Just merged in a change to allow an external VLLM server to be passed into olmocr via the --server flag. So, the ideal setup would be to launch an external vllm with that image, serving the olmocr model (be sure to set the served mode name to olmocr), and try with that flag.
Thank you, @jakep-allenai, for your initial guidance on setting up olmOCR with vLLM and the rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715 Docker image—it was a great starting point and helped me get the model downloaded and partially loaded.
However, I've encountered persistent issues getting the vLLM server to start fully. Here's a detailed summary of my setup, steps tried, and errors faced. I'm hoping you can spot something I've missed, perhaps related to ROCm compatibility, Qwen2.5-VL specifics, or vLLM configuration for the FP8 model.
Setup
- OS: Ubuntu 24.04.3 LTS
- GPU: AMD RX 7900 XTX (24GB VRAM)
- ROCm: 6.4.3 (verified with
rocminfoandrocm-smi—GPU detected as GFX1100, HSA enabled) - Docker image: Initially
rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715(vLLM 0.9.1), upgraded torocm/vllm:rocm6.4.1_vllm_0.10.0_20250812(vLLM 0.10.0) for compatibility fixes - Model:
allenai/olmOCR-7B-0725-FP8(downloaded viahuggingface-clito~/OCR_ai/workspace/cache, ~10-15GB; also tried full-precisionallenai/olmOCR-7B-0725) - Project dir:
~/OCR_ai(with virtual env for host-side tools liketransformers4.55.4 and PyTorch ROCm 6.4) - Goal: Run vLLM server on port 8001 with
--serverflag, test withcurl http://localhost:8001/v1/models, and eventually process a sample PDF
Steps Tried
-
Initial Setup (per your suggestion):
- Used Docker command with
--network=host, devices (/dev/kfd,/dev/dri), groups (video, render), env vars (HSA_OVERRIDE_GFX_VERSION=11.0.0,HIP_VISIBLE_DEVICES=0,HF_TOKEN,VLLM_LOGGING_LEVEL=DEBUG,VLLM_USE_TRITON_FLASH_ATTN=0), volume mount-v $(pwd)/workspace:/workspace, and args like--model /workspace/cache,--trust-remote-code,--gpu-memory-utilization 0.75,--tensor-parallel-size 1. - Fixed syntax errors (invalid reference format, removed unsupported
--download-timeout). - Manually downloaded model to avoid slow HF downloads.
- Used Docker command with
-
Error Mitigation:
- Resolved connection failures (
curl"Couldn't connect to server") by adding--host 0.0.0.0. - Handled cache issues (weights in container's
/root/.cachevs. host; setVLLM_CACHE_DIR=/workspace/cache). - Addressed processor warnings (slow image processor, deprecated
preprocessor.json): Renamed tovideo_preprocessor.json, usedAutoProcessor.from_pretrainedandsave_pretrainedscript withtransformersto setuse_fast=True, addedVLLM_IMAGE_PROCESSOR_USE_FAST=1. - Upgraded to vLLM 0.10.0 to fix
TypeError: Qwen2_5_VLProcessor.__init__() got multiple values for argument 'image_processor'(successful—error gone). - Adjusted memory:
--gpu-memory-utilization 0.8-0.9, addedPYTORCH_HIP_ALLOC_CONF=max_split_size_mb:256,garbage_collection_threshold:0.8,HSA_NO_SCRATCH_RECLAIM=1. - Tried both local cache (
--model /workspace/cache) and model ID (--model allenai/olmOCR-7B-0725-FP8). - Verified GPU detection inside container (
torch.cuda.is_available()returns True).
- Resolved connection failures (
-
Fallbacks:
- Tested full-precision model (
allenai/olmOCR-7B-0725)—same errors. - Cleaned cache configs (backed up extras, kept
preprocessor_config.jsonandvideo_preprocessor.jsonas required).
- Tested full-precision model (
Persistent Errors
- Initial Errors (Resolved in vLLM 0.10.0):
TypeError: Qwen2_5_VLProcessor.__init__(), slow processor warnings, deprecatedpreprocessor.json. - Current Error (in vLLM 0.10.0):
RuntimeError: Engine core initialization failedduringdetermine_available_memoryand_initialize_kv_caches.- Key logs:
MIOpen(HIP): Error [EvaluateInvokers] Failed to launch kernel: invalid argumentUserWarning: Failed validator: GCN_ARCH_NAME(from PyTorch/aten/src/ATen/hip/tunable/Tunable.cpp)- Traceback points to
vllm/v1/engine/core.py:164,vllm/executor/abstract.py:76, failing in collective RPC for device init/memory check.
- Model loads partially (~9.04 GiB VRAM, ~914s), but fails at GPU memory detection/KV cache init.
- No other critical warnings (e.g., SWA support, all-reduce kernel—expected for ROCm).
- Key logs:
Let me know if full logs from latest run is needed.
Questions/Request for Help
Have I missed something in the ROCm setup, vLLM args, or model-specific config for Qwen2.5-VL on RX 7900 XTX? For example:
- Is there a specific
transformersor PyTorch version known to work with this model on ROCm? - Any tweaks for HIP/MIOpen errors (e.g., env vars or kernel params)?
- Should I try a different Docker image or build vLLM from source for GFX1100?
Any advice would be greatly appreciated—thanks again for your help!
mm
So, our rocm setup is very unique, we don't have docker support, only apptainer, but this is the apptainer config I used:
From: rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812
%post
# ---- Configuration ----------------------------------------------------
# Adjust this to the custom Python version that the base image installs
PYTHON_VERSION=3.12
CUSTOM_PY="/usr/bin/python${PYTHON_VERSION}"
# ---- Workaround: temporarily restore distro Python --------------------
DIST_PY=$(ls /usr/bin/python3.[0-9]* | sort -V | head -n1)
# If a python alternative scheme already exists, remember its value so we
# can restore it later; otherwise, we will restore to CUSTOM_PY when we
# are done.
if update-alternatives --query python3 >/dev/null 2>&1; then
ORIGINAL_PY=$(update-alternatives --query python3 | awk -F": " '/Value:/ {print $2}')
else
ORIGINAL_PY=$CUSTOM_PY
fi
echo "Temporarily switching python3 alternative to ${DIST_PY} so that APT scripts use the distro‑built Python runtime."
update-alternatives --install /usr/bin/python3 python3 ${DIST_PY} 1
update-alternatives --set python3 ${DIST_PY}
update-alternatives --install /usr/bin/python python ${DIST_PY} 1
update-alternatives --set python ${DIST_PY}
# ---- APT operations that require the distro python3 -------------------
apt-get update -y
apt-get remove -y python3-blinker || true
# Pre‑seed the Microsoft Core Fonts EULA so the build is non‑interactive
echo "ttf-mscorefonts-installer msttcorefonts/accepted-mscorefonts-eula select true" | debconf-set-selections
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
python3-apt \
update-notifier-common \
poppler-utils \
fonts-crosextra-caladea \
fonts-crosextra-carlito \
gsfonts \
lcdf-typetools \
ttf-mscorefonts-installer
# ---- Restore the original / custom Python alternative -----------------
echo "Restoring python3 alternative to ${ORIGINAL_PY}"
update-alternatives --install /usr/bin/python3 python3 ${ORIGINAL_PY} 1
update-alternatives --set python3 ${ORIGINAL_PY}
update-alternatives --install /usr/bin/python python ${ORIGINAL_PY} 1 || true
update-alternatives --set python ${ORIGINAL_PY} || true
# Ensure pip is available for the restored Python
curl -sS https://bootstrap.pypa.io/get-pip.py | ${ORIGINAL_PY}
# ---- Python‑level dependencies ---------------------------------------
cd /root
git clone https://github.com/allenai/olmocr
cd /root/olmocr
python3 -m pip install --no-cache-dir .[bench]
playwright install-deps
playwright install chromium
python3 -m olmocr.pipeline --help
Later I run like this
export TRANSFORMERS_OFFLINE=1
export HF_DATASETS_OFFLINE=1
export HF_HUB_OFFLINE=1
export HF_DATASETS_CACHE="/lustre/orion/csc652/proj-shared/huggingface-shared/datasets"
export HF_HUB_CACHE="/lustre/orion/csc652/proj-shared/huggingface-shared/hub"
# Was getting MIOpen errors with caching, had to disable for now
export MIOPEN_DISABLE_CACHE=1
# Try without triton flash attention
export VLLM_USE_TRITON_FLASH_ATTN=0
export BENCH_PATH="/lustre/orion/csc652/proj-shared/jakep/olmOCR-bench"
apptainer exec .../olmcr_vllm_rocm.sif bash -c "python -m olmocr.pipeline .../olmocr_bench_workspace_$SLURM_JOB_ID --model allenai/olmOCR-7B-0825 --markdown --pdfs ${BENCH_PATH}/bench_data/pdfs/**/*.pdf"
I have found it necessary to use the non-FP8 version of the model. This is on MI250x's which is all I have access to right now.
I am using 2 x 7900 xtx and rtx 4080 16G. vllm runs only small sized llm which are not much of use in real world tasks. errors arose in running popular open source models 7B or larger. very frustrated and disappointed.
I have managed to start it on 7900 XT 20GB @ Pop!_OS 22.04. For XTX gpu-memory-utilization should be less.
Dockerfile:
FROM rocm/vllm:rocm6.4.1_vllm_0.10.1_20250909
WORKDIR /workspace
# Assume you have cloned the repo locally first:
# git clone --depth 1 https://github.com/allenai/olmocr
COPY . ./olmocr
# Change MAX_TOKENS to use less than 8192 max-model-len:
RUN sed -i 's/MAX_TOKENS = 8000/MAX_TOKENS = 800/g' ./olmocr/olmocr/pipeline.py
RUN apt-get update && \
apt-get install -y --no-install-recommends \
poppler-utils \
fonts-crosextra-caladea \
fonts-crosextra-carlito \
fonts-liberation \
fonts-dejavu-core \
gsfonts \
fontconfig && \
pip3 install --no-cache-dir ./olmocr img2pdf && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* /tmp/* /root/.cache
EXPOSE 8000
Compose:
version: '3.8'
services:
vllm:
image: olmocr:rocm
container_name: vllm-ocr
environment:
- HF_HUB_OFFLINE=1 # set to 0 for the first run
- HF_HUB_CACHE=/workspace/hf_cache/hub
- HSA_OVERRIDE_GFX_VERSION=11.0.0
- MIOPEN_USER_DB_PATH=/tmp/miopen-cache
- MIOPEN_CUSTOM_CACHE_DIR=/tmp/miopen-cache
- VLLM_USE_TRITON_FLASH_ATTN=0
- VLLM_LOGGING_LEVEL=DEBUG
volumes:
- ./input:/input
- ./output:/output
- ./hf_cache:/workspace/hf_cache
- ./miopen_cache:/tmp/miopen-cache
ports:
- "8000:8000"
devices:
- /dev/kfd:/dev/kfd
- /dev/dri:/dev/dri
ipc: host
command: >
python3 -m vllm.entrypoints.openai.api_server
--model allenai/olmOCR-2-7B-1025
--served-model-name olmocr
--gpu-memory-utilization 0.97
--max-model-len 2560
Usage:
docker build -t olmocr:rocm -f Dockerfile olmocr
docker-compose up -d # first run will download the model, wait for complete
docker exec -it vllm-ocr python -m olmocr.pipeline /output/run_$(date +%s) --server http://localhost:8000/v1 --model olmocr --markdown --pdfs /input/input.pdf