fix(jit): add filelock timeout report
I pushed a commit for a case to reproduce the issue that you can try: https://github.com/flashinfer-ai/flashinfer/pull/993/commits/58e83cfdae61aeade02c37d47460af6cad8f3220
Per discussion w/ @abcdabcd987 , we think the deadlock issue might be coming from NFS (if users's cache directory was located in NFS), which is not compliant with POSIX and might influence filelock behavior.
We will create another PR to fix the deadlock issue, and uses tmpfs for filelock instead of ~/.cache directory.
@yzh119 HI, Is this issue related with my stack? version is 0.2.5, trtllm version 0.20.0
stack stucked:
Thread 3463494 (idle): "MainThread"
acquire (/usr/local/lib/python3.10/dist-packages/filelock/_api.py:344)
__enter__ (/usr/local/lib/python3.10/dist-packages/filelock/_api.py:376)
load_cuda_ops (/usr/local/lib/python3.10/dist-packages/flashinfer/jit/core.py:134)
get_norm_module (/usr/local/lib/python3.10/dist-packages/flashinfer/norm.py:36)
get_module_attr (/usr/local/lib/python3.10/dist-packages/flashinfer/norm.py:50)
_rmsnorm (/usr/local/lib/python3.10/dist-packages/flashinfer/norm.py:98)
rmsnorm (/usr/local/lib/python3.10/dist-packages/flashinfer/norm.py:86)
flashinfer_rmsnorm (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/custom_ops/flashinfer_custom_ops.py:47)
wrapped_fn (/usr/local/lib/python3.10/dist-packages/torch/_library/custom_ops.py:367)
_fn (/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py:838)
inner (/usr/local/lib/python3.10/dist-packages/torch/_compile.py:51)
backend_impl (/usr/local/lib/python3.10/dist-packages/torch/_library/custom_ops.py:335)
__call__ (/usr/local/lib/python3.10/dist-packages/torch/_ops.py:756)
__call__ (/usr/local/lib/python3.10/dist-packages/torch/_library/custom_ops.py:671)
forward (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/modules/rms_norm.py:43)
_call_impl (/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1762)
_wrapped_call_impl (/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1751)
forward (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_llama.py:522)
_call_impl (/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1762)
_wrapped_call_impl (/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1751)
forward (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_llama.py:790)
_call_impl (/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1762)
_wrapped_call_impl (/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1751)
forward (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py:517)
model_forward (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py:2000)
_forward_step (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py:2012)
forward (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py:1962)
wrapper (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/utils.py:66)
decorate_context (/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py:116)
warmup (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py:679)
__init__ (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py:245)
create_py_executor_instance (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/_util.py:446)
create_py_executor (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py:190)
_create_engine (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/worker.py:126)
__init__ (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/worker.py:128)
worker_main (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor/worker.py:698)
wrapper (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/utils.py:35)
call (/usr/local/lib/python3.10/dist-packages/mpi4py/futures/_core.py:844)
server_exec (/usr/local/lib/python3.10/dist-packages/mpi4py/futures/_core.py:865)
server_main_comm (/usr/local/lib/python3.10/dist-packages/mpi4py/futures/_core.py:1215)
server_main_spawn (/usr/local/lib/python3.10/dist-packages/mpi4py/futures/_core.py:1222)
server_main (/usr/local/lib/python3.10/dist-packages/mpi4py/futures/_core.py:1254)
main (/usr/local/lib/python3.10/dist-packages/mpi4py/futures/server.py:11)
<module> (/usr/local/lib/python3.10/dist-packages/mpi4py/futures/server.py:15)
_run_code (/usr/lib/python3.10/runpy.py:86)
_run_module_as_main (/usr/lib/python3.10/runpy.py:196)
Thread 3463624 (idle): "Thread-1 (_read_thread)"
_recv_msg (/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py:55)
_read_thread (/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py:191)
run (/usr/lib/python3.10/threading.py:953)
_bootstrap_inner (/usr/lib/python3.10/threading.py:1016)
_bootstrap (/usr/lib/python3.10/threading.py:973)
Thread 3463781 (idle): "Thread-2"
wait (/usr/lib/python3.10/threading.py:324)
wait (/usr/lib/python3.10/threading.py:607)
run (/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py:60)
_bootstrap_inner (/usr/lib/python3.10/threading.py:1016)
_bootstrap (/usr/lib/python3.10/threading.py:973)
Hi @foreverlms the deadlock issue have resolved after https://github.com/flashinfer-ai/flashinfer/issues/1064 led by @abcdabcd987 , but it's not available in v0.2.5 Would you mind upgrading to a later version of flashinfer?
Hi @foreverlms the deadlock issue have resolved after #1064 led by @abcdabcd987 , but it's not available in v0.2.5 Would you mind upgrading to a later version of flashinfer?
Hi Zihao, would you mind that explain why there is dead-lock? It seems related with JIT. I am using trt-llm, and for the past a week the demo program works fine with flashinfer as the RMSNorm backend. But suddenly from last night, the demo script will hang. I spent a few hours to figure out why and finally realized it's related with flashinfer. So is this some thing like resources limitation of JIT caching? The trtllm v0.20.0 requires flash infer 0.2.5
Whatever I will upgrade flashinfer to have a try.
so what is the root cause of deadlock, can you elaborate on it? @yzh119