mypy Improve performance in environments with long search paths

In my work environment, we editably install most Python packages. This leads to long search paths, e.g. 200 entries is common. I think it should be possible to significantly improve mypy's performance in this case.

My benchmark workload is mypy -c "import torch" on a mypyc-compiled mypy with compile level 3.

I'll run it in the following environments:

clean

rm -rf clean
python -m venv clean
uv pip install torch --python clean/bin/python

long

rm -rf long
python -m venv long
uv pip install torch --python long/bin/python
for i in $(seq 1 200); do
    dir=$(pwd)/repo/$i
    mkdir -p $dir
    echo $dir >> $(long/bin/python -c "import site; print(site.getsitepackages()[0])")/repo.pth
done

openai This is my main dev environment. I'll see if I can make an artificial environment that matches the performance characteristics of this more closely (this is pretty easy, just need to install a bunch of third party libraries).

bd9200bda is my baseline commit

λ hyperfine -w 1 -M 3 '/tmp/mypy_primer/timer_mypy_bd9200bda/venv/bin/mypy -c "import torch" --python-executable=clean/bin/python --no-incremental'
Benchmark 1: /tmp/mypy_primer/timer_mypy_bd9200bda/venv/bin/mypy -c "import torch" --python-executable=clean/bin/python --no-incremental
  Time (mean ± σ):     19.372 s ±  0.179 s    [User: 17.018 s, System: 2.285 s]
  Range (min … max):   19.223 s … 19.570 s    3 runs

λ hyperfine -w 1 -M 3 '/tmp/mypy_primer/timer_mypy_bd9200bda/venv/bin/mypy -c "import torch" --python-executable=long/bin/python --no-incremental' 
Benchmark 1: /tmp/mypy_primer/timer_mypy_bd9200bda/venv/bin/mypy -c "import torch" --python-executable=long/bin/python --no-incremental
  Time (mean ± σ):     34.571 s ±  0.085 s    [User: 31.770 s, System: 2.762 s]
  Range (min … max):   34.499 s … 34.664 s    3 runs

λ hyperfine -w 1 -M 3 '/tmp/mypy_primer/timer_mypy_bd9200bda/venv/bin/mypy  -c "import torch" --no-incremental --python-executable /opt/oai/bin/python'
Benchmark 1: /tmp/mypy_primer/timer_mypy_bd9200bda/venv/bin/mypy  -c "import torch" --no-incremental --python-executable /opt/oai/bin/python
  Time (mean ± σ):     51.342 s ±  0.472 s    [User: 46.853 s, System: 4.423 s]
  Range (min … max):   50.840 s … 51.776 s    3 runs

https://github.com/python/mypy/pull/17920 has already provided a big win here

88ae62b4a was the commit I measured

λ hyperfine -w 1 -M 3 '/tmp/mypy_primer/timer_mypy_88ae62b4a/venv/bin/mypy -c "import torch" --python-executable=clean/bin/python --no-incremental'
Benchmark 1: /tmp/mypy_primer/timer_mypy_88ae62b4a/venv/bin/mypy -c "import torch" --python-executable=clean/bin/python --no-incremental
  Time (mean ± σ):     19.094 s ±  0.195 s    [User: 16.782 s, System: 2.243 s]
  Range (min … max):   18.935 s … 19.312 s    3 runs

λ hyperfine -w 1 -M 3 '/tmp/mypy_primer/timer_mypy_88ae62b4a/venv/bin/mypy -c "import torch" --python-executable=long/bin/python --no-incremental' 
Benchmark 1: /tmp/mypy_primer/timer_mypy_88ae62b4a/venv/bin/mypy -c "import torch" --python-executable=long/bin/python --no-incremental
  Time (mean ± σ):     24.838 s ±  0.237 s    [User: 22.038 s, System: 2.750 s]
  Range (min … max):   24.598 s … 25.073 s    3 runs

λ hyperfine -w 1 -M 3 '/tmp/mypy_primer/timer_mypy_88ae62b4a/venv/bin/mypy  -c "import torch" --no-incremental --python-executable /opt/oai/bin/python'
Benchmark 1: /tmp/mypy_primer/timer_mypy_88ae62b4a/venv/bin/mypy  -c "import torch" --no-incremental --python-executable /opt/oai/bin/python
  Time (mean ± σ):     34.161 s ±  0.163 s    [User: 29.818 s, System: 4.289 s]
  Range (min … max):   34.013 s … 34.336 s    3 runs

You can see that mypy in my environment is still 1.8x slower than it could be (and 1.3x slower in the reproducible toy environment).

Some ideas for things to experiment with:

We could make fscache cleverer, seeing if we can scandir on parents to get cheaper is_dir and is_file. Especially when querying the existence of entries that are on search paths.
We could avoid some of the case sensitive handling if we know our file system is case sensitive
We could vendor some of os.path into mypy, so that mypyc can compile these functions
- Starting with https://github.com/python/mypy/pull/17949
- https://github.com/python/mypy/pull/17962 (although even just not using pathlib would probably be great)
Use the fast path in modulefinder in more places
- https://github.com/python/mypy/pull/17950
Misc
- https://github.com/python/mypy/pull/17965

Oct 15 '24 05:10 hauntsaninja

What about filtering the module search path based on the first component(s) of the target module name? We could create a dict that maps a module name prefix <prefix> to the search path filtered based on the existence of a <prefix> directory, <prefix>.py or <prefix>.pyi in the search path entry.

For example, if torch is only present in a single search path entry, the search path for the torch prefix would only contain this single item. If we are resolving, say, torch.foo, we'd first look up the filtered search path based on the torch prefix. This would usually contain only a single item, so performance should be similar to the easy/clean case, even if there are hundreds of search path entries.

If many search path entries have the same directory/namespace package (e.g. common/), we could also filter by a length-two prefix. So we'd have module search path for common.a mapping to search path entries that contain common/a/, common/a.py or common/a.pyi. Creating this lookup table could be slightly expensive, so we'd probably want to only build the second-level mapping when there are more than N matching search path entries for some top-level package, and only build it for these packages.

To determine the effective search path for module, we'd look up prefixes of length 2 and 1 (e.g. pkg.a and pkg for module pkg.a.b) to find a filtered search path. Building the top-level lookup table should be pretty quick, so we can probably always use it. We'd use the second-level lookup table when it exists, and otherwise fallback to the first-level table.

Oct 15 '24 09:10 JukkaL

Recording new baseline numbers here for eb816b05c (after a few of ^ PRs have been merged):

λ hyperfine -w 1 -M 3 '/tmp/mypy_primer/timer_mypy_eb816b05c/venv/bin/mypy  -c "import torch" --no-incremental --python-executable clean/bin/python'
Benchmark 1: /tmp/mypy_primer/timer_mypy_eb816b05c/venv/bin/mypy  -c "import torch" --no-incremental --python-executable clean/bin/python
  Time (mean ± σ):     18.240 s ±  0.046 s    [User: 16.671 s, System: 1.552 s]
  Range (min … max):   18.201 s … 18.291 s    3 runs
 
λ hyperfine -w 1 -M 3 '/tmp/mypy_primer/timer_mypy_eb816b05c/venv/bin/mypy  -c "import torch" --no-incremental --python-executable long/bin/python' 
Benchmark 1: /tmp/mypy_primer/timer_mypy_eb816b05c/venv/bin/mypy  -c "import torch" --no-incremental --python-executable long/bin/python
  Time (mean ± σ):     21.581 s ±  0.115 s    [User: 19.600 s, System: 1.965 s]
  Range (min … max):   21.496 s … 21.712 s    3 runs
 
λ hyperfine -w 1 -M 3 '/tmp/mypy_primer/timer_mypy_eb816b05c/venv/bin/mypy  -c "import torch" --no-incremental --python-executable /opt/oai/bin/python'
Benchmark 1: /tmp/mypy_primer/timer_mypy_eb816b05c/venv/bin/mypy  -c "import torch" --no-incremental --python-executable /opt/oai/bin/python
  Time (mean ± σ):     28.439 s ±  0.270 s    [User: 25.591 s, System: 2.829 s]
  Range (min … max):   28.197 s … 28.731 s    3 runs

Compared to https://github.com/python/mypy/commit/bd9200bda5595fc71c01fe0dff9debbce3467a84 we are:

1.06x faster on clean
1.6x faster on long
1.8x faster on openai
1.6x faster on openai incremental (9.376 -> 5.847)

Oct 15 '24 19:10 hauntsaninja

New numbers for c201a187b (with orjson installed):

hyperfine -w 1 -M 3 /tmp/mypy_primer/timer_mypy_c201a187b/venv/bin/mypy -c 'import torch' --no-incremental --python-executable clean/bin/python
Benchmark 1: /tmp/mypy_primer/timer_mypy_c201a187b/venv/bin/mypy -c 'import torch' --no-incremental --python-executable clean/bin/python
  Time (mean ± σ):     17.205 s ±  0.057 s    [User: 15.689 s, System: 1.500 s]
  Range (min … max):   17.153 s … 17.265 s    3 runs
 
hyperfine -w 1 -M 3 /tmp/mypy_primer/timer_mypy_c201a187b/venv/bin/mypy -c 'import torch' --no-incremental --python-executable long/bin/python
Benchmark 1: /tmp/mypy_primer/timer_mypy_c201a187b/venv/bin/mypy -c 'import torch' --no-incremental --python-executable long/bin/python
  Time (mean ± σ):     19.361 s ±  0.373 s    [User: 17.489 s, System: 1.857 s]
  Range (min … max):   19.102 s … 19.789 s    3 runs

The openai environment I was using previously got mutated, so not posting raw numbers for that. In the following, I re-ran the bd9200b baseline in a similar environment to get fair openai comparisons. Compared to https://github.com/python/mypy/commit/bd9200bda5595fc71c01fe0dff9debbce3467a84 we are:

1.13x faster on clean
1.18x faster on clean incremental (1.06x faster without orjson)
1.79x faster on long
1.92 faster on similar openai
2.19x faster on similar openai incremental (2.05x faster without orjson)

Oct 17 '24 02:10 hauntsaninja

@hauntsaninja Are you interested in looking into filtering the search path (see my comment above)? If not, I might have a look at it at some point.

Oct 17 '24 16:10 JukkaL

Yup, I'm interested in looking into it. Worth noting that the difference between "clean" and "long" is down to 1.13x (from 1.8x), so I'm prioritising things that will help "clean" and "openai", rather than specifically "long". The difference between "openai" and "long" seems to just be many more entries in site-packages (but equal number in sys.path). This is still a little mysterious to me, maybe torch has some hidden dependencies or something.

Oct 17 '24 18:10 hauntsaninja

Posting more numbers:

+ /tmp/mypy_primer/timer_mypy_3420ef155/venv/bin/python -m mypy --version
mypy 1.14.0+dev.3420ef1554c40b433a638e31cb2109e591e85008 (compiled: yes)
+ hyperfine -w 1 -M 3 '/tmp/mypy_primer/timer_mypy_3420ef155/venv/bin/python -m mypy -c '\''import torch'\'' --no-incremental --python-executable clean/bin/python'
Benchmark 1: /tmp/mypy_primer/timer_mypy_3420ef155/venv/bin/python -m mypy -c 'import torch' --no-incremental --python-executable clean/bin/python
  Time (mean ± σ):     19.671 s ±  0.155 s    [User: 18.219 s, System: 1.439 s]
  Range (min … max):   19.551 s … 19.845 s    3 runs
 
+ hyperfine -w 1 -M 3 '/tmp/mypy_primer/timer_mypy_3420ef155/venv/bin/python -m mypy -c '\''import torch'\'' --no-incremental --python-executable long/bin/python'
Benchmark 1: /tmp/mypy_primer/timer_mypy_3420ef155/venv/bin/python -m mypy -c 'import torch' --no-incremental --python-executable long/bin/python
  Time (mean ± σ):     21.881 s ±  0.089 s    [User: 20.061 s, System: 1.807 s]
  Range (min … max):   21.784 s … 21.957 s    3 runs
 
+ hyperfine -w 1 -M 3 '/tmp/mypy_primer/timer_mypy_3420ef155/venv/bin/python -m mypy -c '\''import torch'\'' --no-incremental --python-executable /opt/oai/bin/python'
Benchmark 1: /tmp/mypy_primer/timer_mypy_3420ef155/venv/bin/python -m mypy -c 'import torch' --no-incremental --python-executable /opt/oai/bin/python
  Time (mean ± σ):     28.509 s ±  0.212 s    [User: 26.081 s, System: 2.409 s]
  Range (min … max):   28.364 s … 28.752 s    3 runs
 
+ hyperfine -w 2 -M 3 '/tmp/mypy_primer/timer_mypy_3420ef155/venv/bin/python -m mypy -c '\''import torch'\'' --python-executable clean/bin/python'
Benchmark 1: /tmp/mypy_primer/timer_mypy_3420ef155/venv/bin/python -m mypy -c 'import torch' --python-executable clean/bin/python
  Time (mean ± σ):      2.085 s ±  0.005 s    [User: 1.718 s, System: 0.366 s]
  Range (min … max):    2.081 s …  2.091 s    3 runs
 
+ hyperfine -w 2 -M 3 '/tmp/mypy_primer/timer_mypy_3420ef155/venv/bin/python -m mypy -c '\''import torch'\'' --python-executable long/bin/python'
Benchmark 1: /tmp/mypy_primer/timer_mypy_3420ef155/venv/bin/python -m mypy -c 'import torch' --python-executable long/bin/python
  Time (mean ± σ):      3.104 s ±  0.006 s    [User: 2.288 s, System: 0.816 s]
  Range (min … max):    3.098 s …  3.110 s    3 runs
 
+ hyperfine -w 2 -M 3 '/tmp/mypy_primer/timer_mypy_3420ef155/venv/bin/python -m mypy -c '\''import torch'\'' --python-executable /opt/oai/bin/python'
Benchmark 1: /tmp/mypy_primer/timer_mypy_3420ef155/venv/bin/python -m mypy -c 'import torch' --python-executable /opt/oai/bin/python
  Time (mean ± σ):      3.928 s ±  0.006 s    [User: 3.067 s, System: 0.861 s]
  Range (min … max):    3.922 s …  3.932 s    3 runs

set -x
export MYPY_CACHE_DIR=mypycache/$COMMIT
mkdir -p "$MYPY_CACHE_DIR"
mkdir benchjson
export PYTHON="/tmp/mypy_primer/timer_mypy_$COMMIT/venv/bin/python"
$PYTHON -m pip install orjson
$PYTHON -m mypy --version
hyperfine -w 1 -M 5 --export-json "benchjson/${COMMIT}_clean.json" "$PYTHON -m mypy -c 'import torch' --no-incremental --python-executable clean/bin/python"
hyperfine -w 1 -M 5 --export-json "benchjson/${COMMIT}_long.json" "$PYTHON -m mypy -c 'import torch' --no-incremental --python-executable long/bin/python"
hyperfine -w 1 -M 5 --export-json "benchjson/${COMMIT}_oai.json" "$PYTHON -m mypy -c 'import torch' --no-incremental --python-executable /opt/oai/bin/python"
hyperfine -w 2 -M 5 --export-json "benchjson/${COMMIT}_clean_inc.json" "$PYTHON -m mypy -c 'import torch' --python-executable clean/bin/python"
hyperfine -w 2 -M 5 --export-json "benchjson/${COMMIT}_long_inc.json" "$PYTHON -m mypy -c 'import torch' --python-executable long/bin/python"
hyperfine -w 2 -M 5 --export-json "benchjson/${COMMIT}_oai_inc.json" "$PYTHON -m mypy -c 'import torch' --python-executable /opt/oai/bin/python"

Oct 25 '24 07:10 hauntsaninja

Okay, with https://github.com/python/mypy/pull/18038 and the follow up https://github.com/python/mypy/pull/18045, we're down to within noise between the clean and long environments

See timings here: https://github.com/python/mypy/pull/18045#issuecomment-2438758007

So I think we can call this complete! :-)

Oct 26 '24 02:10 hauntsaninja