flashinfer icon indicating copy to clipboard operation
flashinfer copied to clipboard

Prebuilt kernels not found, using JIT backend

Open rangehow opened this issue 9 months ago • 14 comments

Under what circumstances will this prompt appear? Since my environment is offline, I downloaded the .whl file from the release and installed it successfully on the server using pip, with no errors during the entire process. Should I be concerned about this prompt, and does it indicate a potential decrease in inference performance?

rangehow avatar Feb 19 '25 08:02 rangehow

Hi @rangehow , that indicates you are using the JIT version instead of AOT version.

I downloaded the .whl file from the release and installed it successfully on the server using pip

What's the url of the wheel file you download from?

For JIT version, If you properly warmup the kernel compilation, this would not bring performance decrease. We have cache for JIT compiled kernels.

yzh119 avatar Feb 19 '25 16:02 yzh119

Thanks for your explanation. I download wheel from release in this repository. https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.1.post2/flashinfer_python-0.2.1.post2+cu121torch2.5-cp38-abi3-linux_x86_64.whl So if there exists some way that we can download .whl for AOT version?

rangehow avatar Feb 20 '25 02:02 rangehow

Hi, those wheels are already AOT version, JIT version only include source code but not binary files (only 1.7mb on pypi).

I guess you might have installed multiple versions of flashinfer, you can try

import flashinfer
print(flashinfer.__version__)

and check whether it matches with the wheel you installed.

yzh119 avatar Feb 23 '25 22:02 yzh119

2025-02-25 08:37:29,910 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
INFO 02-25 08:37:30 __init__.py:207] Automatically detected platform cuda.
vLLM ver:0.7.3
FlashInfer ver:0.2.1.post2+cu121torch2.5

kgboyko avatar Feb 25 '25 08:02 kgboyko

Hi I'm also seeing the same issue from a fresh install (python 3.10, CUDA 12.2, using the cu121 wheel). Is this expected?

~/ > mkdir import-flashinfer
~/ > cd import-flashinfer
~/import-flashinfer/ > python3 -m venv .venv
~/import-flashinfer/ > source .venv/bin/activate
(.venv) ~/import-flashinfer/ > pip install flashinfer-python --find-links https://flashinfer.ai/whl/cu121/torch2.5/flashinfer-python
Looking in links: https://flashinfer.ai/whl/cu121/torch2.5/flashinfer-python
Collecting flashinfer-python
  Downloading https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.2.post1/flashinfer_python-0.2.2.post1%2Bcu121torch2.5-cp38-abi3-linux_x86_64.whl (527.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 527.1/527.1 MB 7.4 MB/s eta 0:00:00
Collecting torch==2.5.*
  Using cached torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl (906.4 MB)
Collecting nvidia-cublas-cu12==12.4.5.8
  Using cached nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl (363.4 MB)
Collecting jinja2
  Using cached jinja2-3.1.6-py3-none-any.whl (134 kB)
Collecting filelock
  Using cached filelock-3.17.0-py3-none-any.whl (16 kB)
Collecting sympy==1.13.1
  Using cached sympy-1.13.1-py3-none-any.whl (6.2 MB)
Collecting nvidia-curand-cu12==10.3.5.147
  Using cached nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl (56.3 MB)
Collecting nvidia-cuda-cupti-cu12==12.4.127
  Using cached nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (13.8 MB)
Collecting nvidia-nvjitlink-cu12==12.4.127
  Using cached nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (21.1 MB)
Collecting fsspec
  Using cached fsspec-2025.2.0-py3-none-any.whl (184 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70
  Using cached nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)
Collecting triton==3.1.0
  Using cached triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (209.5 MB)
Collecting nvidia-nccl-cu12==2.21.5
  Using cached nvidia_nccl_cu12-2.21.5-py3-none-manylinux2014_x86_64.whl (188.7 MB)
Collecting nvidia-cusolver-cu12==11.6.1.9
  Using cached nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl (127.9 MB)
Collecting nvidia-cusparse-cu12==12.3.1.170
  Using cached nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl (207.5 MB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127
  Using cached nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (24.6 MB)
Collecting nvidia-cuda-runtime-cu12==12.4.127
  Using cached nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (883 kB)
Collecting nvidia-nvtx-cu12==12.4.127
  Using cached nvidia_nvtx_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (99 kB)
Collecting networkx
  Using cached networkx-3.4.2-py3-none-any.whl (1.7 MB)
Collecting nvidia-cufft-cu12==11.2.1.3
  Using cached nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl (211.5 MB)
Collecting typing-extensions>=4.8.0
  Using cached typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Collecting mpmath<1.4,>=1.1.0
  Using cached mpmath-1.3.0-py3-none-any.whl (536 kB)
Collecting MarkupSafe>=2.0
  Using cached MarkupSafe-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20 kB)
Installing collected packages: mpmath, typing-extensions, sympy, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch, flashinfer-python
Successfully installed MarkupSafe-3.0.2 filelock-3.17.0 flashinfer-python-0.2.2.post1+cu121torch2.5 fsspec-2025.2.0 jinja2-3.1.6 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 nvidia-nccl-cu12-2.21.5 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.4.127 sympy-1.13.1 torch-2.5.1 triton-3.1.0 typing-extensions-4.12.2

[notice] A new release of pip is available: 23.0.1 -> 25.0.1
[notice] To update, run: pip install --upgrade pip
(.venv) ~/import-flashinfer/ > python
Python 3.10.14 (main, Jan 18 2025, 03:01:18) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import flashinfer
/home/user/import-flashinfer/.venv/lib/python3.10/site-packages/torch/_subclasses/functional_tensor.py:295: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
2025-03-05 20:19:35,478 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
>>> flashinfer.__version__
'0.2.2.post1+cu121torch2.5'

bshimanuki avatar Mar 05 '25 20:03 bshimanuki

Hi, @yzh119 . It seems the problem needs more attention since a lot of people also encounter this.

The reply you provided suggests a method for checking, but the results always show no issues. Furthermore, even if there is a problem with the check, there is no viable solution. The current issue seems to be that the version installed through the AOT wheel is ultimately still the JIT version.

rangehow avatar Mar 06 '25 05:03 rangehow

same problem

import flashinfer 2025-03-14 15:07:23,015 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend print(flashinfer.version) 0.2.3+cu121torch2.5

Dada-Cloudzxy avatar Mar 14 '25 07:03 Dada-Cloudzxy

Same problem here. My container takes a lot of time for compiling the kernels when startup.

>>> import torch
>>> torch.__version__
'2.4.0+cu121'
>>> import flashinfer
2025-03-20 03:09:00,820 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
>>> flashinfer.__version__
'0.2.3+cu121torch2.4'

the root cause is probably this: https://github.com/flashinfer-ai/flashinfer/blob/main/flashinfer/jit/init.py#L67

>>> import flashinfer.flashinfer_kernels_sm90  
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'flashinfer.flashinfer_kernels_sm90'

flashinfer_kernels_sm90.abi3.so is missing in https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.3/flashinfer_python-0.2.3+cu121torch2.4-cp38-abi3-linux_x86_64.whl

yawnzh avatar Mar 20 '25 03:03 yawnzh

Same problem here. My container takes a lot of time for compiling the kernels when startup.

>>> import torch
>>> torch.__version__
'2.4.0+cu121'
>>> import flashinfer
2025-03-20 03:09:00,820 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
>>> flashinfer.__version__
'0.2.3+cu121torch2.4'

the root cause is probably this: https://github.com/flashinfer-ai/flashinfer/blob/main/flashinfer/jit/init.py#L67

>>> import flashinfer.flashinfer_kernels_sm90  
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'flashinfer.flashinfer_kernels_sm90'

flashinfer_kernels_sm90.abi3.so is missing in https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.3/flashinfer_python-0.2.3+cu121torch2.4-cp38-abi3-linux_x86_64.whl

when using flashinfer with cuda12.4, it only needs to compile a customized kernel for dumping attention logits, but when using flashinfer with cuda 12.1, it also needs to compile the batch decode op:

cuda 12.1 log

2025-03-20 08:21:46,410 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_dump_logits_False
2025-03-20 08:22:11,437 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_dump_logits_False
2025-03-20 08:22:11,449 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-03-20 08:22:36,508 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False

cuda 12.4 log

2025-03-20 08:46:48,870 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_dump_logits_False
2025-03-20 08:47:10,692 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_dump_logits_False

I can't use cuda 12.4 because I also needs vLLM and it doesn't have precompiled version for cuda12.4

yawnzh avatar Mar 20 '25 09:03 yawnzh

Well, looks like I can install vLLM cuda 12.1 and flashinfer cuda 12.4 together, problem solved!

yawnzh avatar Mar 21 '25 02:03 yawnzh

Hi @yawnzh can you please paste the installation command here :) I am facing the same issue on Kaggle.

ghost avatar Mar 26 '25 02:03 ghost

Hi @yawnzh can you please paste the installation command here :) I am facing the same issue on Kaggle.

pip install flashinfer-python -i https://flashinfer.ai/whl/cu124/torch2.4/

yawnzh avatar Mar 26 '25 08:03 yawnzh

Thanks, this worked for me as well!

wangray avatar Mar 29 '25 16:03 wangray

@yawnzh I also run pip install flashinfer-python -i https://flashinfer.ai/whl/cu124/torch2.4/ but vllm still output "INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend"

qiulang avatar Apr 13 '25 09:04 qiulang

@yawnzh I also run pip install flashinfer-python -i https://flashinfer.ai/whl/cu124/torch2.4/ but vllm still output "INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend"

@yawnzh getting the same despite attempting several workarounds - is this expected behaviour? please confirm 🙏

zazer0 avatar May 27 '25 12:05 zazer0

After upgrade to vllm 0.9 & flashinfer 0.25 that message has gone, now I saw these, so I guess the problem was fixed

2025-05-28 17:01:20,390 - INFO - flashinfer.jit: Loading JIT ops: sampling
/home/vllm/miniconda3/envs/vllm_env/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
/home/vllm/miniconda3/envs/vllm_env/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
2025-05-28 17:01:57,699 - INFO - flashinfer.jit: Finished loading JIT ops: sampling

qiulang avatar May 28 '25 09:05 qiulang

For JIT version, If you properly warmup the kernel compilation, this would not bring performance decrease. We have cache for JIT compiled kernels.

I am mainly worried about a performance penalty of this warning. @yzh119 Could you please expand a little what you mean by warmup the kernel compilation?

ahakanbaba avatar Sep 26 '25 05:09 ahakanbaba