olmocr icon indicating copy to clipboard operation
olmocr copied to clipboard

Is possible to run this model with ROCm?

Open marcussacana opened this issue 9 months ago • 11 comments

I have a 7900 XTX, a ROCm enabled card with 24GB of VRAM I tried install vLLM but it just got OOM maybe because it was no flash-attention support.

tried ollama with GGUF version, but it just returns me empty response.

marcussacana avatar Mar 08 '25 19:03 marcussacana

+1 I'd also would run it on CPU or ROCm

grigio avatar Mar 09 '25 08:03 grigio

And updates on this ?

salsasteve avatar Jun 04 '25 03:06 salsasteve

We released FP8 version of the olmOCR. You can try that out.

aman-17 avatar Jul 10 '25 21:07 aman-17

Yes, this runs on ROCM, I have a few experiments I have tried. I don't have official instructions to share, but you will need to install the rocm vllm and it should pretty much work.

jakep-allenai avatar Jul 24 '25 18:07 jakep-allenai

Yes, this runs on ROCM, I have a few experiments I have tried. I don't have official instructions to share, but you will need to install the rocm vllm and it should pretty much work.

I had problem as well to install in RDNA3 due atten not has proper support to my arch, but cool to known is working, closing the issue in this case.

marcussacana avatar Jul 24 '25 21:07 marcussacana

Hi @jakep-allenai, Thanks for confirming that olmOCR runs on ROCm with the ROCm vLLM. I'm trying to set it up on AMD RX 7900 XTX with ROCm 6.2 on Ubuntu 22.04 in Docker, but I'm hitting persistent errors in vLLM's setup.py (e.g., AssertionError: CUDA_HOME is not set or TypeError: unsupported operand type(s) for +: 'NoneType' and 'str' despite patches to bypass CUDA checks). My setup:

Base image: rocm/dev-ubuntu-22.04:6.2 PyTorch: 2.3.1 (from https://download.pytorch.org/whl/rocm6.2/) vLLM: Built from source (tried v0.6.2, v0.6.1) with VLLM_BUILD_ROCM=1 and patches to setup.py (bypassing CUDA_HOME and get_nvcc_cuda_version). olmOCR: Installed with [gpu] --no-deps to avoid CUDA vLLM. Environment variables: ROCM_HOME=/opt/rocm, HSA_OVERRIDE_GFX_VERSION=11.0.0, HIP_VISIBLE_DEVICES=0. rocm-smi detects the GPU, but torch.cuda.is_available() returns False with 'No NVIDIA driver' error.

Can you share your exact ROCm vLLM setup? Specifically:

vLLM version or branch? PyTorch/ROCm versions? Any patches to vLLM's setup.py or olmOCR's check.py? Docker or host setup details? Thanks for your help!

ma526mac3K8 avatar Aug 24 '25 13:08 ma526mac3K8

Have you tried this image? https://hub.docker.com/layers/rocm/vllm/rocm6.4.1_vllm_0.9.1_20250715/images/sha256-4a429705fa95a58f6d20aceab43b1b76fa769d57f32d5d28bd3f4e030e2a78ea

@haydn-jones Just merged in a change to allow an external VLLM server to be passed into olmocr via the --server flag. So, the ideal setup would be to launch an external vllm with that image, serving the olmocr model (be sure to set the served mode name to olmocr), and try with that flag.

jakep-allenai avatar Aug 25 '25 15:08 jakep-allenai

Thank you, @jakep-allenai, for your initial guidance on setting up olmOCR with vLLM and the rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715 Docker image—it was a great starting point and helped me get the model downloaded and partially loaded.

However, I've encountered persistent issues getting the vLLM server to start fully. Here's a detailed summary of my setup, steps tried, and errors faced. I'm hoping you can spot something I've missed, perhaps related to ROCm compatibility, Qwen2.5-VL specifics, or vLLM configuration for the FP8 model.

Setup

  • OS: Ubuntu 24.04.3 LTS
  • GPU: AMD RX 7900 XTX (24GB VRAM)
  • ROCm: 6.4.3 (verified with rocminfo and rocm-smi—GPU detected as GFX1100, HSA enabled)
  • Docker image: Initially rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715 (vLLM 0.9.1), upgraded to rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812 (vLLM 0.10.0) for compatibility fixes
  • Model: allenai/olmOCR-7B-0725-FP8 (downloaded via huggingface-cli to ~/OCR_ai/workspace/cache, ~10-15GB; also tried full-precision allenai/olmOCR-7B-0725)
  • Project dir: ~/OCR_ai (with virtual env for host-side tools like transformers 4.55.4 and PyTorch ROCm 6.4)
  • Goal: Run vLLM server on port 8001 with --server flag, test with curl http://localhost:8001/v1/models, and eventually process a sample PDF

Steps Tried

  1. Initial Setup (per your suggestion):

    • Used Docker command with --network=host, devices (/dev/kfd, /dev/dri), groups (video, render), env vars (HSA_OVERRIDE_GFX_VERSION=11.0.0, HIP_VISIBLE_DEVICES=0, HF_TOKEN, VLLM_LOGGING_LEVEL=DEBUG, VLLM_USE_TRITON_FLASH_ATTN=0), volume mount -v $(pwd)/workspace:/workspace, and args like --model /workspace/cache, --trust-remote-code, --gpu-memory-utilization 0.75, --tensor-parallel-size 1.
    • Fixed syntax errors (invalid reference format, removed unsupported --download-timeout).
    • Manually downloaded model to avoid slow HF downloads.
  2. Error Mitigation:

    • Resolved connection failures (curl "Couldn't connect to server") by adding --host 0.0.0.0.
    • Handled cache issues (weights in container's /root/.cache vs. host; set VLLM_CACHE_DIR=/workspace/cache).
    • Addressed processor warnings (slow image processor, deprecated preprocessor.json): Renamed to video_preprocessor.json, used AutoProcessor.from_pretrained and save_pretrained script with transformers to set use_fast=True, added VLLM_IMAGE_PROCESSOR_USE_FAST=1.
    • Upgraded to vLLM 0.10.0 to fix TypeError: Qwen2_5_VLProcessor.__init__() got multiple values for argument 'image_processor' (successful—error gone).
    • Adjusted memory: --gpu-memory-utilization 0.8-0.9, added PYTORCH_HIP_ALLOC_CONF=max_split_size_mb:256,garbage_collection_threshold:0.8, HSA_NO_SCRATCH_RECLAIM=1.
    • Tried both local cache (--model /workspace/cache) and model ID (--model allenai/olmOCR-7B-0725-FP8).
    • Verified GPU detection inside container (torch.cuda.is_available() returns True).
  3. Fallbacks:

    • Tested full-precision model (allenai/olmOCR-7B-0725)—same errors.
    • Cleaned cache configs (backed up extras, kept preprocessor_config.json and video_preprocessor.json as required).

Persistent Errors

  • Initial Errors (Resolved in vLLM 0.10.0): TypeError: Qwen2_5_VLProcessor.__init__(), slow processor warnings, deprecated preprocessor.json.
  • Current Error (in vLLM 0.10.0): RuntimeError: Engine core initialization failed during determine_available_memory and _initialize_kv_caches.
    • Key logs:
      • MIOpen(HIP): Error [EvaluateInvokers] Failed to launch kernel: invalid argument
      • UserWarning: Failed validator: GCN_ARCH_NAME (from PyTorch/aten/src/ATen/hip/tunable/Tunable.cpp)
      • Traceback points to vllm/v1/engine/core.py:164, vllm/executor/abstract.py:76, failing in collective RPC for device init/memory check.
    • Model loads partially (~9.04 GiB VRAM, ~914s), but fails at GPU memory detection/KV cache init.
    • No other critical warnings (e.g., SWA support, all-reduce kernel—expected for ROCm).

Let me know if full logs from latest run is needed.

Questions/Request for Help

Have I missed something in the ROCm setup, vLLM args, or model-specific config for Qwen2.5-VL on RX 7900 XTX? For example:

  • Is there a specific transformers or PyTorch version known to work with this model on ROCm?
  • Any tweaks for HIP/MIOpen errors (e.g., env vars or kernel params)?
  • Should I try a different Docker image or build vLLM from source for GFX1100?

Any advice would be greatly appreciated—thanks again for your help!

mm

ma526mac3K8 avatar Aug 27 '25 07:08 ma526mac3K8

So, our rocm setup is very unique, we don't have docker support, only apptainer, but this is the apptainer config I used:

From: rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812

%post
    # ---- Configuration ----------------------------------------------------
    # Adjust this to the custom Python version that the base image installs
    PYTHON_VERSION=3.12
    CUSTOM_PY="/usr/bin/python${PYTHON_VERSION}"

    # ---- Workaround: temporarily restore distro Python --------------------
    DIST_PY=$(ls /usr/bin/python3.[0-9]* | sort -V | head -n1)

    # If a python alternative scheme already exists, remember its value so we
    # can restore it later; otherwise, we will restore to CUSTOM_PY when we
    # are done.
    if update-alternatives --query python3 >/dev/null 2>&1; then
        ORIGINAL_PY=$(update-alternatives --query python3 | awk -F": " '/Value:/ {print $2}')
    else
        ORIGINAL_PY=$CUSTOM_PY
    fi

    echo "Temporarily switching python3 alternative to ${DIST_PY} so that APT scripts use the distro‑built Python runtime."
    update-alternatives --install /usr/bin/python3 python3 ${DIST_PY} 1
    update-alternatives --set python3 ${DIST_PY}
    update-alternatives --install /usr/bin/python python ${DIST_PY} 1
    update-alternatives --set python ${DIST_PY}

    # ---- APT operations that require the distro python3 -------------------
    apt-get update -y
    apt-get remove -y python3-blinker || true

    # Pre‑seed the Microsoft Core Fonts EULA so the build is non‑interactive
    echo "ttf-mscorefonts-installer msttcorefonts/accepted-mscorefonts-eula select true" | debconf-set-selections

    DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
        python3-apt \
        update-notifier-common \
        poppler-utils \
        fonts-crosextra-caladea \
        fonts-crosextra-carlito \
        gsfonts \
        lcdf-typetools \
        ttf-mscorefonts-installer

    # ---- Restore the original / custom Python alternative -----------------
    echo "Restoring python3 alternative to ${ORIGINAL_PY}"
    update-alternatives --install /usr/bin/python3 python3 ${ORIGINAL_PY} 1
    update-alternatives --set python3 ${ORIGINAL_PY}
    update-alternatives --install /usr/bin/python python ${ORIGINAL_PY} 1 || true
    update-alternatives --set python ${ORIGINAL_PY} || true

    # Ensure pip is available for the restored Python
    curl -sS https://bootstrap.pypa.io/get-pip.py | ${ORIGINAL_PY}

    # ---- Python‑level dependencies ---------------------------------------
    cd /root
    git clone https://github.com/allenai/olmocr
    cd /root/olmocr

    python3 -m pip install --no-cache-dir .[bench]
    playwright install-deps
    playwright install chromium

    python3 -m olmocr.pipeline --help

Later I run like this

export TRANSFORMERS_OFFLINE=1
export HF_DATASETS_OFFLINE=1
export HF_HUB_OFFLINE=1

export HF_DATASETS_CACHE="/lustre/orion/csc652/proj-shared/huggingface-shared/datasets"
export HF_HUB_CACHE="/lustre/orion/csc652/proj-shared/huggingface-shared/hub"

# Was getting MIOpen errors with caching, had to disable for now
export MIOPEN_DISABLE_CACHE=1

# Try without triton flash attention
export VLLM_USE_TRITON_FLASH_ATTN=0

export BENCH_PATH="/lustre/orion/csc652/proj-shared/jakep/olmOCR-bench"

apptainer exec .../olmcr_vllm_rocm.sif bash -c "python -m olmocr.pipeline .../olmocr_bench_workspace_$SLURM_JOB_ID --model allenai/olmOCR-7B-0825 --markdown --pdfs ${BENCH_PATH}/bench_data/pdfs/**/*.pdf"

I have found it necessary to use the non-FP8 version of the model. This is on MI250x's which is all I have access to right now.

jakep-allenai avatar Sep 10 '25 19:09 jakep-allenai

I am using 2 x 7900 xtx and rtx 4080 16G. vllm runs only small sized llm which are not much of use in real world tasks. errors arose in running popular open source models 7B or larger. very frustrated and disappointed.

wcwong22000 avatar Oct 07 '25 13:10 wcwong22000

I have managed to start it on 7900 XT 20GB @ Pop!_OS 22.04. For XTX gpu-memory-utilization should be less.

Dockerfile:

FROM rocm/vllm:rocm6.4.1_vllm_0.10.1_20250909

WORKDIR /workspace

# Assume you have cloned the repo locally first:
# git clone --depth 1 https://github.com/allenai/olmocr
COPY . ./olmocr

# Change MAX_TOKENS to use less than 8192 max-model-len:
RUN sed -i 's/MAX_TOKENS = 8000/MAX_TOKENS = 800/g' ./olmocr/olmocr/pipeline.py

RUN apt-get update && \
    apt-get install -y --no-install-recommends \
        poppler-utils \
        fonts-crosextra-caladea \
        fonts-crosextra-carlito \
        fonts-liberation \
        fonts-dejavu-core \
        gsfonts \
        fontconfig && \
    pip3 install --no-cache-dir ./olmocr img2pdf && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* /tmp/* /root/.cache

EXPOSE 8000

Compose:

version: '3.8'

services:

  vllm:
    image: olmocr:rocm
    container_name: vllm-ocr
    environment:
      - HF_HUB_OFFLINE=1 # set to 0 for the first run
      - HF_HUB_CACHE=/workspace/hf_cache/hub
      - HSA_OVERRIDE_GFX_VERSION=11.0.0
      - MIOPEN_USER_DB_PATH=/tmp/miopen-cache
      - MIOPEN_CUSTOM_CACHE_DIR=/tmp/miopen-cache
      - VLLM_USE_TRITON_FLASH_ATTN=0
      - VLLM_LOGGING_LEVEL=DEBUG
    volumes:
      - ./input:/input
      - ./output:/output
      - ./hf_cache:/workspace/hf_cache
      - ./miopen_cache:/tmp/miopen-cache
    ports:
      - "8000:8000"
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    ipc: host
    command: >
      python3 -m vllm.entrypoints.openai.api_server
      --model allenai/olmOCR-2-7B-1025
      --served-model-name olmocr
      --gpu-memory-utilization 0.97
      --max-model-len 2560

Usage:

docker build -t olmocr:rocm -f Dockerfile olmocr

docker-compose up -d # first run will download the model, wait for complete

docker exec -it vllm-ocr python -m olmocr.pipeline /output/run_$(date +%s) --server http://localhost:8000/v1 --model olmocr --markdown --pdfs /input/input.pdf

imsgit avatar Nov 17 '25 17:11 imsgit