flashinfer icon indicating copy to clipboard operation
flashinfer copied to clipboard

fix(jit): add filelock timeout report

Open zobinHuang opened this issue 9 months ago • 1 comments

zobinHuang avatar Apr 01 '25 08:04 zobinHuang

I pushed a commit for a case to reproduce the issue that you can try: https://github.com/flashinfer-ai/flashinfer/pull/993/commits/58e83cfdae61aeade02c37d47460af6cad8f3220

yzh119 avatar Apr 02 '25 21:04 yzh119

Per discussion w/ @abcdabcd987 , we think the deadlock issue might be coming from NFS (if users's cache directory was located in NFS), which is not compliant with POSIX and might influence filelock behavior.

We will create another PR to fix the deadlock issue, and uses tmpfs for filelock instead of ~/.cache directory.

yzh119 avatar May 16 '25 19:05 yzh119

@yzh119 HI, Is this issue related with my stack? version is 0.2.5, trtllm version 0.20.0

stack stucked:

Thread 3463494 (idle): "MainThread"
    acquire (/usr/local/lib/python3.10/dist-packages/filelock/_api.py:344)
    __enter__ (/usr/local/lib/python3.10/dist-packages/filelock/_api.py:376)
    load_cuda_ops (/usr/local/lib/python3.10/dist-packages/flashinfer/jit/core.py:134)
    get_norm_module (/usr/local/lib/python3.10/dist-packages/flashinfer/norm.py:36)
    get_module_attr (/usr/local/lib/python3.10/dist-packages/flashinfer/norm.py:50)
    _rmsnorm (/usr/local/lib/python3.10/dist-packages/flashinfer/norm.py:98)
    rmsnorm (/usr/local/lib/python3.10/dist-packages/flashinfer/norm.py:86)
    flashinfer_rmsnorm (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/custom_ops/flashinfer_custom_ops.py:47)
    wrapped_fn (/usr/local/lib/python3.10/dist-packages/torch/_library/custom_ops.py:367)
    _fn (/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py:838)
    inner (/usr/local/lib/python3.10/dist-packages/torch/_compile.py:51)
    backend_impl (/usr/local/lib/python3.10/dist-packages/torch/_library/custom_ops.py:335)
    __call__ (/usr/local/lib/python3.10/dist-packages/torch/_ops.py:756)
    __call__ (/usr/local/lib/python3.10/dist-packages/torch/_library/custom_ops.py:671)
    forward (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/modules/rms_norm.py:43)
    _call_impl (/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1762)
    _wrapped_call_impl (/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1751)
    forward (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_llama.py:522)
    _call_impl (/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1762)
    _wrapped_call_impl (/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1751)
    forward (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_llama.py:790)
    _call_impl (/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1762)
    _wrapped_call_impl (/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1751)
    forward (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py:517)
    model_forward (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py:2000)
    _forward_step (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py:2012)
    forward (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py:1962)
    wrapper (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/utils.py:66)
    decorate_context (/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py:116)
    warmup (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py:679)
    __init__ (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py:245)
    create_py_executor_instance (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/_util.py:446)
    create_py_executor (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py:190)
    _create_engine (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/worker.py:126)
    __init__ (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/worker.py:128)
    worker_main (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/worker.py:698)
    wrapper (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/utils.py:35)
    call (/usr/local/lib/python3.10/dist-packages/mpi4py/futures/_core.py:844)
    server_exec (/usr/local/lib/python3.10/dist-packages/mpi4py/futures/_core.py:865)
    server_main_comm (/usr/local/lib/python3.10/dist-packages/mpi4py/futures/_core.py:1215)
    server_main_spawn (/usr/local/lib/python3.10/dist-packages/mpi4py/futures/_core.py:1222)
    server_main (/usr/local/lib/python3.10/dist-packages/mpi4py/futures/_core.py:1254)
    main (/usr/local/lib/python3.10/dist-packages/mpi4py/futures/server.py:11)
    <module> (/usr/local/lib/python3.10/dist-packages/mpi4py/futures/server.py:15)
    _run_code (/usr/lib/python3.10/runpy.py:86)
    _run_module_as_main (/usr/lib/python3.10/runpy.py:196)
Thread 3463624 (idle): "Thread-1 (_read_thread)"
    _recv_msg (/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py:55)
    _read_thread (/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py:191)
    run (/usr/lib/python3.10/threading.py:953)
    _bootstrap_inner (/usr/lib/python3.10/threading.py:1016)
    _bootstrap (/usr/lib/python3.10/threading.py:973)
Thread 3463781 (idle): "Thread-2"
    wait (/usr/lib/python3.10/threading.py:324)
    wait (/usr/lib/python3.10/threading.py:607)
    run (/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py:60)
    _bootstrap_inner (/usr/lib/python3.10/threading.py:1016)
    _bootstrap (/usr/lib/python3.10/threading.py:973)

foreverlms avatar Jul 23 '25 08:07 foreverlms

Hi @foreverlms the deadlock issue have resolved after https://github.com/flashinfer-ai/flashinfer/issues/1064 led by @abcdabcd987 , but it's not available in v0.2.5 Would you mind upgrading to a later version of flashinfer?

yzh119 avatar Jul 23 '25 08:07 yzh119

Hi @foreverlms the deadlock issue have resolved after #1064 led by @abcdabcd987 , but it's not available in v0.2.5 Would you mind upgrading to a later version of flashinfer?

Hi Zihao, would you mind that explain why there is dead-lock? It seems related with JIT. I am using trt-llm, and for the past a week the demo program works fine with flashinfer as the RMSNorm backend. But suddenly from last night, the demo script will hang. I spent a few hours to figure out why and finally realized it's related with flashinfer. So is this some thing like resources limitation of JIT caching? The trtllm v0.20.0 requires flash infer 0.2.5

Whatever I will upgrade flashinfer to have a try.

foreverlms avatar Jul 23 '25 08:07 foreverlms

so what is the root cause of deadlock, can you elaborate on it? @yzh119

Jin-Chuan avatar Aug 28 '25 08:08 Jin-Chuan