flashinfer
flashinfer copied to clipboard
Prebuilt kernels not found, using JIT backend
Under what circumstances will this prompt appear? Since my environment is offline, I downloaded the .whl file from the release and installed it successfully on the server using pip, with no errors during the entire process. Should I be concerned about this prompt, and does it indicate a potential decrease in inference performance?
Hi @rangehow , that indicates you are using the JIT version instead of AOT version.
I downloaded the .whl file from the release and installed it successfully on the server using pip
What's the url of the wheel file you download from?
For JIT version, If you properly warmup the kernel compilation, this would not bring performance decrease. We have cache for JIT compiled kernels.
Thanks for your explanation. I download wheel from release in this repository. https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.1.post2/flashinfer_python-0.2.1.post2+cu121torch2.5-cp38-abi3-linux_x86_64.whl
So if there exists some way that we can download .whl for AOT version?
Hi, those wheels are already AOT version, JIT version only include source code but not binary files (only 1.7mb on pypi).
I guess you might have installed multiple versions of flashinfer, you can try
import flashinfer
print(flashinfer.__version__)
and check whether it matches with the wheel you installed.
2025-02-25 08:37:29,910 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
INFO 02-25 08:37:30 __init__.py:207] Automatically detected platform cuda.
vLLM ver:0.7.3
FlashInfer ver:0.2.1.post2+cu121torch2.5
Hi I'm also seeing the same issue from a fresh install (python 3.10, CUDA 12.2, using the cu121 wheel). Is this expected?
~/ > mkdir import-flashinfer
~/ > cd import-flashinfer
~/import-flashinfer/ > python3 -m venv .venv
~/import-flashinfer/ > source .venv/bin/activate
(.venv) ~/import-flashinfer/ > pip install flashinfer-python --find-links https://flashinfer.ai/whl/cu121/torch2.5/flashinfer-python
Looking in links: https://flashinfer.ai/whl/cu121/torch2.5/flashinfer-python
Collecting flashinfer-python
Downloading https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.2.post1/flashinfer_python-0.2.2.post1%2Bcu121torch2.5-cp38-abi3-linux_x86_64.whl (527.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 527.1/527.1 MB 7.4 MB/s eta 0:00:00
Collecting torch==2.5.*
Using cached torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl (906.4 MB)
Collecting nvidia-cublas-cu12==12.4.5.8
Using cached nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl (363.4 MB)
Collecting jinja2
Using cached jinja2-3.1.6-py3-none-any.whl (134 kB)
Collecting filelock
Using cached filelock-3.17.0-py3-none-any.whl (16 kB)
Collecting sympy==1.13.1
Using cached sympy-1.13.1-py3-none-any.whl (6.2 MB)
Collecting nvidia-curand-cu12==10.3.5.147
Using cached nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl (56.3 MB)
Collecting nvidia-cuda-cupti-cu12==12.4.127
Using cached nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (13.8 MB)
Collecting nvidia-nvjitlink-cu12==12.4.127
Using cached nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (21.1 MB)
Collecting fsspec
Using cached fsspec-2025.2.0-py3-none-any.whl (184 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70
Using cached nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)
Collecting triton==3.1.0
Using cached triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (209.5 MB)
Collecting nvidia-nccl-cu12==2.21.5
Using cached nvidia_nccl_cu12-2.21.5-py3-none-manylinux2014_x86_64.whl (188.7 MB)
Collecting nvidia-cusolver-cu12==11.6.1.9
Using cached nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl (127.9 MB)
Collecting nvidia-cusparse-cu12==12.3.1.170
Using cached nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl (207.5 MB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127
Using cached nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (24.6 MB)
Collecting nvidia-cuda-runtime-cu12==12.4.127
Using cached nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (883 kB)
Collecting nvidia-nvtx-cu12==12.4.127
Using cached nvidia_nvtx_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (99 kB)
Collecting networkx
Using cached networkx-3.4.2-py3-none-any.whl (1.7 MB)
Collecting nvidia-cufft-cu12==11.2.1.3
Using cached nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl (211.5 MB)
Collecting typing-extensions>=4.8.0
Using cached typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Collecting mpmath<1.4,>=1.1.0
Using cached mpmath-1.3.0-py3-none-any.whl (536 kB)
Collecting MarkupSafe>=2.0
Using cached MarkupSafe-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20 kB)
Installing collected packages: mpmath, typing-extensions, sympy, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch, flashinfer-python
Successfully installed MarkupSafe-3.0.2 filelock-3.17.0 flashinfer-python-0.2.2.post1+cu121torch2.5 fsspec-2025.2.0 jinja2-3.1.6 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 nvidia-nccl-cu12-2.21.5 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.4.127 sympy-1.13.1 torch-2.5.1 triton-3.1.0 typing-extensions-4.12.2
[notice] A new release of pip is available: 23.0.1 -> 25.0.1
[notice] To update, run: pip install --upgrade pip
(.venv) ~/import-flashinfer/ > python
Python 3.10.14 (main, Jan 18 2025, 03:01:18) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import flashinfer
/home/user/import-flashinfer/.venv/lib/python3.10/site-packages/torch/_subclasses/functional_tensor.py:295: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
cpu = _conversion_method_template(device=torch.device("cpu"))
2025-03-05 20:19:35,478 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
>>> flashinfer.__version__
'0.2.2.post1+cu121torch2.5'
Hi, @yzh119 . It seems the problem needs more attention since a lot of people also encounter this.
The reply you provided suggests a method for checking, but the results always show no issues. Furthermore, even if there is a problem with the check, there is no viable solution. The current issue seems to be that the version installed through the AOT wheel is ultimately still the JIT version.
same problem
import flashinfer 2025-03-14 15:07:23,015 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend print(flashinfer.version) 0.2.3+cu121torch2.5
Same problem here. My container takes a lot of time for compiling the kernels when startup.
>>> import torch
>>> torch.__version__
'2.4.0+cu121'
>>> import flashinfer
2025-03-20 03:09:00,820 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
>>> flashinfer.__version__
'0.2.3+cu121torch2.4'
the root cause is probably this: https://github.com/flashinfer-ai/flashinfer/blob/main/flashinfer/jit/init.py#L67
>>> import flashinfer.flashinfer_kernels_sm90
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'flashinfer.flashinfer_kernels_sm90'
flashinfer_kernels_sm90.abi3.so is missing in https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.3/flashinfer_python-0.2.3+cu121torch2.4-cp38-abi3-linux_x86_64.whl
Same problem here. My container takes a lot of time for compiling the kernels when startup.
>>> import torch >>> torch.__version__ '2.4.0+cu121' >>> import flashinfer 2025-03-20 03:09:00,820 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend >>> flashinfer.__version__ '0.2.3+cu121torch2.4'the root cause is probably this: https://github.com/flashinfer-ai/flashinfer/blob/main/flashinfer/jit/init.py#L67
>>> import flashinfer.flashinfer_kernels_sm90 Traceback (most recent call last): File "<stdin>", line 1, in <module> ModuleNotFoundError: No module named 'flashinfer.flashinfer_kernels_sm90'flashinfer_kernels_sm90.abi3.so is missing in https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.3/flashinfer_python-0.2.3+cu121torch2.4-cp38-abi3-linux_x86_64.whl
when using flashinfer with cuda12.4, it only needs to compile a customized kernel for dumping attention logits, but when using flashinfer with cuda 12.1, it also needs to compile the batch decode op:
cuda 12.1 log
2025-03-20 08:21:46,410 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_dump_logits_False
2025-03-20 08:22:11,437 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_dump_logits_False
2025-03-20 08:22:11,449 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-03-20 08:22:36,508 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
cuda 12.4 log
2025-03-20 08:46:48,870 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_dump_logits_False
2025-03-20 08:47:10,692 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_dump_logits_False
I can't use cuda 12.4 because I also needs vLLM and it doesn't have precompiled version for cuda12.4
Well, looks like I can install vLLM cuda 12.1 and flashinfer cuda 12.4 together, problem solved!
Hi @yawnzh can you please paste the installation command here :) I am facing the same issue on Kaggle.
Hi @yawnzh can you please paste the installation command here :) I am facing the same issue on Kaggle.
pip install flashinfer-python -i https://flashinfer.ai/whl/cu124/torch2.4/
Thanks, this worked for me as well!
@yawnzh I also run pip install flashinfer-python -i https://flashinfer.ai/whl/cu124/torch2.4/ but vllm still output "INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend"
@yawnzh I also run
pip install flashinfer-python -i https://flashinfer.ai/whl/cu124/torch2.4/but vllm still output "INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend"
@yawnzh getting the same despite attempting several workarounds - is this expected behaviour? please confirm 🙏
After upgrade to vllm 0.9 & flashinfer 0.25 that message has gone, now I saw these, so I guess the problem was fixed
2025-05-28 17:01:20,390 - INFO - flashinfer.jit: Loading JIT ops: sampling
/home/vllm/miniconda3/envs/vllm_env/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
/home/vllm/miniconda3/envs/vllm_env/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
2025-05-28 17:01:57,699 - INFO - flashinfer.jit: Finished loading JIT ops: sampling
For JIT version, If you properly warmup the kernel compilation, this would not bring performance decrease. We have cache for JIT compiled kernels.
I am mainly worried about a performance penalty of this warning. @yzh119 Could you please expand a little what you mean by warmup the kernel compilation?