sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Feature] support torch compile cache for DeepSeek V3/R1

Open zhyncs opened this issue 10 months ago • 1 comments

Checklist

  • [ ] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • [ ] 2. Please use English, otherwise it will be closed.

Motivation

as titled

The time taken for each startup is currently too long when torch compile is enabled. It needs optimization.

Related resources

No response

zhyncs avatar Feb 16 '25 16:02 zhyncs

I will work on this.

FrankLeeeee avatar Feb 17 '25 10:02 FrankLeeeee

If this function is not implemented yet, how will TORCHINDUCTOR_CACHE_DIR option behavior? Will the files in /tmp/torchinductor_root/ simply been ignored when server starting?

junliu-mde avatar Feb 20 '25 12:02 junliu-mde

If this function is not implemented yet, how will TORCHINDUCTOR_CACHE_DIR option behavior? Will the files in /tmp/torchinductor_root/ simply been ignored when server starting?

According to the official pytorch doc at https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html, torch.compile enables caching by default. Just that it saves the cache in /tmp/torchinductor_root which might be cleared in some time. Explicitly setting TORCHINDUCTOR_CACHE_DIR will save your cache in a specified directory, which you can copy to other machines for reuse.

FrankLeeeee avatar Feb 21 '25 02:02 FrankLeeeee

If this function is not implemented yet, how will TORCHINDUCTOR_CACHE_DIR option behavior? Will the files in /tmp/torchinductor_root/ simply been ignored when server starting?

According to the official pytorch doc at https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html, torch.compile enables caching by default. Just that it saves the cache in /tmp/torchinductor_root which might be cleared in some time. Explicitly setting TORCHINDUCTOR_CACHE_DIR will save your cache in a specified directory, which you can copy to other machines for reuse.

In that case, what needed to be support is only saving/reading cache to/from arbitrarily path, correct? If I copy /tmp/torchinductor_root from one machine to another, it still works on both machine?

junliu-mde avatar Feb 21 '25 02:02 junliu-mde

If this function is not implemented yet, how will TORCHINDUCTOR_CACHE_DIR option behavior? Will the files in /tmp/torchinductor_root/ simply been ignored when server starting?如果这个功能尚未实现, TORCHINDUCTOR_CACHE_DIR 选项会如何表现?当服务器启动时, /tmp/torchinductor_root/ 中的文件会被简单地忽略吗?

According to the official pytorch doc at https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html, torch.compile enables caching by default. Just that it saves the cache in /tmp/torchinductor_root which might be cleared in some time. Explicitly setting TORCHINDUCTOR_CACHE_DIR will save your cache in a specified directory, which you can copy to other machines for reuse.根据官方 PyTorch 文档在 https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html, torch.compile 默认启用缓存。只是它将缓存保存到 /tmp/torchinductor_root ,这可能在某些时候被清除。显式设置 TORCHINDUCTOR_CACHE_DIR 将把您的缓存保存在指定的目录中,您可以将其复制到其他机器以供重用。

In that case, what needed to be support is only saving/reading cache to/from arbitrarily path, correct? If I copy /tmp/torchinductor_root from one machine to another, it still works on both machine?

yes, I will add some content on caching in DeepSeek doc. The cache will still work if the both machines have the same hardware.

FrankLeeeee avatar Feb 21 '25 02:02 FrankLeeeee

It still takes too long(~180s on deepseek-r1, with bs[1,2,4,8,16,24,32,40,48,56,64,65]) during cuda graph capturing while enable torch compile. It seems torch only reuses part of the TORCHINDUCTOR_CACHE_DIR and recompiles the other every time the sglang starts.

Changes below could reduce the time but I have no idea whether we set dynamic=False in this commit: https://github.com/sgl-project/sglang/commit/07ec07ad1fa59e0f07a4fcd1b1f324123c2e2bd4

--- a/python/sglang/srt/model_executor/cuda_graph_runner.py
+++ b/python/sglang/srt/model_executor/cuda_graph_runner.py
@@ -105,7 +105,7 @@ def patch_model(
                 mode=os.environ.get(
                     "SGLANG_TORCH_COMPILE_MODE", "max-autotune-no-cudagraphs"
                 ),
-                dynamic=False,
+                dynamic=True,
             )
         else:
             yield model.forward

tianyuzhou95 avatar Jun 20 '25 07:06 tianyuzhou95