MoE-Infinity icon indicating copy to clipboard operation
MoE-Infinity copied to clipboard

[BUG] CUDA Error: Invalid Device Ordinal on Single GPU Setup (NVIDIA RTX 3080)

Open ZiweiSong96 opened this issue 3 months ago • 0 comments

Prerequisites

  • [x] I have read the MoE-Infinity documentation.
  • [x] I have searched the Issue Tracker to ensure this hasn't been reported before.

System Information

GPU: NVIDIA GeForce RTX 3080

NVIDIA Driver Version: 560.35.03

CUDA Toolkit Version (from nvcc -V): 12.1

PyTorch Version: torch 2.5.1+cu121

Python Version: 3.9

Installation Method: Built moe-infinity from source in a clean conda environment. Just as suggested in readme.

Problem Description

I am consistently encountering a CUDA error: invalid device ordinal when trying to load a Mixtral model on a single GPU system, even after ensuring a perfectly matched environment (PyTorch for CUDA 12.1 and system CUDA Toolkit 12.1). The error seems to originate from the low-level Archer C++ backend during model initialization. Standard debugging steps like setting CUDA_VISIBLE_DEVICES=0 do not resolve the issue. I use a cached Mixtral Model on my device, which is the base model of "mixtral-offloading"

The bug log is as follows: /home/mlabszw/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:411: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, qweight, scales, qzeros, g_idx, bits, maxq): /home/mlabszw/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:419: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): /home/mlabszw/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:461: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. @custom_fwd(cast_inputs=torch.float16) CUDA extension not installed. CUDA extension not installed. Do not detect pre-installed ops, use JIT mode āœ… Using checkpoint: /home/mlabszw/2025_paper_1/AdapMoE/Mixtral-8x7B-Instruct-v0.1-offloading-demo āœ… Using cache path: /home/mlabszw/2025_paper_2/2026_rtas/moe_infinity_baseline/model_caching

šŸ”„ Loading tokenizer... Tokenizer loaded successfully.

šŸ”„ Loading model with moe-infinity engine... [WARNING] FlashAttention is not available in the current environment. Using default attention. Using /home/mlabszw/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Emitting ninja build file /home/mlabszw/.cache/torch_extensions/py39_cu121/prefetch/build.ninja... Building extension module prefetch... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module prefetch... Time to load prefetch op: 2.34490704536438 seconds [20251008 22:24:55.852284Z ][540102 ][INFO ]Create ArcherAioThread for thread: 0 - archer_aio_thread.cpp:12 [20251008 22:24:55.852391Z ][540102 ][INFO ]Index file /home/mlabszw/2025_paper_2/2026_rtas/moe_infinity_baseline/model_caching/archer_index does not exist, creating - archer_tensor_handle.cpp:48 [20251008 22:24:55.852395Z ][540102 ][INFO ]Index file size 0 - archer_tensor_handle.cpp:50 [20251008 22:24:55.852507Z ][540102 ][INFO ]Device count 1 - archer_prefetch_handle.cpp:40 [20251008 22:24:55.852511Z ][540102 ][INFO ]Enabled peer access for all devices - archer_prefetch_handle.cpp:63 Creating model from scratch ... Loading checkpoint files: 0%| | 0/257 [00:00<?, ?it/s] āŒ Error during model loading: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

šŸ’” Tip: If you encountered a CUDA error, ensure your drivers and PyTorch installation are compatible. ArcherTaskPool destructor

Steps to Reproduce

Create a clean conda environment with Python 3.9.

Install PyTorch for CUDA 12.1: pip install torch --index-url https://download.pytorch.org/whl/cu121

Install dependencies: pip install transformers accelerate sentencepiece

Clone the MoE-Infinity repository and build from source using pip install ..

Run the following Python script:

import os os.environ['CUDA_VISIBLE_DEVICES'] = '0'

import torch from transformers import AutoTokenizer from moe_infinity import MoE

def run_mixtral_inference(): """ Main function to load a Mixtral model using moe-infinity and run inference. """ checkpoint = "Mixtral-8x7B-Instruct-v0.1-offloading-demo"

cache_path  = "model_caching"
os.makedirs(cache_path, exist_ok=True)

print(f"āœ… Using checkpoint: {checkpoint}")
print(f"āœ… Using cache path: {cache_path}")

print("\nšŸ”„ Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
print("Tokenizer loaded successfully.")

moe_config = {
    "offload_path": cache_path,

    "device_memory_ratio": 0.25,
}

print("\nšŸ”„ Loading model with moe-infinity engine...")

try:
    model = MoE(checkpoint, moe_config)
    print("āœ… Model loaded successfully onto device:", model.model.device)
except Exception as e:
    print(f"āŒ Error during model loading: {e}")
    print("\nšŸ’” Tip: If you encountered a CUDA error, ensure your drivers and PyTorch installation are compatible.")
    return

messages = [
    {"role": "user", "content": "What are the main challenges in developing Mixture-of-Experts models?"},
]

print("\nšŸ”„ Preparing inputs with chat template...")

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.model.device)

print("šŸš€ Generating response...")
with torch.no_grad(): # Disable gradient calculation for inference
    outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, top_p=0.9, temperature=0.7)
print("Generation complete.")

response_text = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)

print("\n" + "="*50)
print("šŸ’¬ Model Response:")
print("="*50)
print(response_text)
print("="*50)

if name == "main": run_mixtral_inference()

Expected Behavior

No response

Additional Context

No response

Usage Statistics (Optional)

No response

ZiweiSong96 avatar Oct 08 '25 14:10 ZiweiSong96