mypy is slow when type checking torch
λ mypy --version
mypy 1.11.2 (compiled: yes)
λ uv pip show torch
Using Python 3.11.8 environment at /Users/shantanu/.virtualenvs/openai-wfht
Name: torch
Version: 2.1.0
Location: /Users/shantanu/.virtualenvs/openai-wfht/lib/python3.11/site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions
Required-by: ...
λ time mypy -c 'import torch' --no-incremental
Success: no issues found in 1 source file
mypy -c 'import torch' --no-incremental 33.09s user 2.73s system 98% cpu 36.391 total
λ time mypy -c 'import torch'
Success: no issues found in 1 source file
mypy -c 'import torch' 6.24s user 0.88s system 95% cpu 7.454 total
We use a lot of torch at work, performance is probably the biggest reason folks at work switch to a different type checker.
If this is accurate, maybe the fscache exception handling is really slowing us down in the mypyc build.
mypyc: native
interpreted: interpreted
mypy -v produces details about processed files, and this seems important:
LOG: Processing SCC of size 945 (torch.onnx._globals torch._inductor.exc torch._inductor.runtime.hi
nts torch.utils._traceback torch.utils._sympy.functions ... <long output snipped>
Mypy detects an import cycle with 945 modules.
Overall 1380 files were parsed, so 68% of processed files are in this one SCC. I've seen this pattern in other third-party packages as well -- the majority of the implementation is a single SCC.
A potential way to make the SCC smaller would be to process imports lazily in third-party modules (where this is possible, since errors aren't reported). It may be tricky to implement though, but I'll think about it more.
Yeah, lazy import resolution could be a massive perf win
https://github.com/python/mypy/issues/17924 is the issue for tracking lazy resolution
Jukka's times in https://github.com/python/mypy/pull/17920#issuecomment-2406966926 are much better than mine. https://github.com/python/mypy/issues/17948 is the issue for tracking performance improvements in my work environment.
Performance is now a lot better, but I bet there are still some good opportunities to make this faster. Fresh CPU profiles would be interesting to see.
Here's a new profile for 53134979c !
Install torch, along with a few extra dependencies:
rm -rf torchenv
python -m venv torchenv
uv pip install --python torchenv/bin/python torch matplotlib onnx optree types-redis --exclude-newer 2024-10-29
Then I get the following on Python 3.11:
λ hyperfine -w 1 -M 3 '/tmp/mypy_primer/timer_mypy_53134979c/venv/bin/mypy -c "import torch" --python-executable=torchenv/bin/python --no-incremental'
Benchmark 1: /tmp/mypy_primer/timer_mypy_53134979c/venv/bin/mypy -c "import torch" --python-executable=torchenv/bin/python --no-incremental
Time (mean ± σ): 27.210 s ± 0.194 s [User: 25.506 s, System: 1.684 s]
Range (min … max): 27.052 s … 27.426 s 3 run
Here's the output of:
py-spy record --native -- /tmp/mypy_primer/timer_mypy_53134979c/venv/bin/python -m mypy -c "import torch" --no-incremental --python-executable torchenv/bin/python
(I realised py-spy also supports --format speedscope which can be nicer, but is harder to just link on Github)
@hauntsaninja I've merged some additional optimizations. It would be interesting to see if the numbers have improved.
They have indeed improved!
With this env on Python 3.11:
rm -rf torchenv
python -m venv torchenv
uv pip install --python torchenv/bin/python torch matplotlib onnx optree types-redis --exclude-newer 2024-10-29
Running the following:
export PYTHON="/tmp/mypy_primer/timer_mypy_$COMMIT/venv/bin/python"
$PYTHON -m pip install orjson
$PYTHON -m mypy --version
hyperfine -w 1 -M 5 "$PYTHON -m mypy -c 'import torch' --python-executable torchenv/bin/python"
hyperfine -w 2 -M 5 "$PYTHON -m mypy -c 'import torch' --python-executable torchenv/bin/python --no-incremental"
I get:
Benchmark 1: /tmp/mypy_primer/timer_mypy_eb310343/venv/bin/python -m mypy -c 'import torch' --python-executable torchenv/bin/python
Time (mean ± σ): 3.151 s ± 0.163 s [User: 2.593 s, System: 0.556 s]
Range (min … max): 3.008 s … 3.409 s 5 runs
hyperfine -w 1 -M 5 38.61s user 4.73s system 100% cpu 43.331 total
Benchmark 1: /tmp/mypy_primer/timer_mypy_eb310343/venv/bin/python -m mypy -c 'import torch' --python-executable torchenv/bin/python --no-incremental
Time (mean ± σ): 27.366 s ± 0.579 s [User: 25.290 s, System: 2.052 s]
Range (min … max): 26.552 s … 28.128 s 5 runs
mypy 1.15.0+dev.d33cef8396c456d87db16dce3525ebf431f4b57f (compiled: yes)
Benchmark 1: /tmp/mypy_primer/timer_mypy_d33cef83/venv/bin/python -m mypy -c 'import torch' --python-executable torchenv/bin/python
Time (mean ± σ): 2.473 s ± 0.038 s [User: 1.966 s, System: 0.505 s]
Range (min … max): 2.443 s … 2.538 s 5 runs
hyperfine -w 1 -M 5 33.29s user 4.25s system 100% cpu 37.536 total
Benchmark 1: /tmp/mypy_primer/timer_mypy_d33cef83/venv/bin/python -m mypy -c 'import torch' --python-executable torchenv/bin/python --no-incremental
Time (mean ± σ): 25.583 s ± 0.375 s [User: 23.681 s, System: 1.884 s]
Range (min … max): 25.091 s … 26.134 s 5 runs
So latest master is 1.27x faster on incremental and 1.07x faster on non-incremental compared to 1.13
On the latest master, if you use --fixed-format-cache that was recently added by @ilevkivskyi, warm runs (with cache generated) are significantly faster that before. #19681 also helped with cache deserialization speed. I hope to work on #17924 to further improve performance in incremental mode.