Diagnostic results are inconsistent when using "--threads" option on sympy
The "mypy_primer" results for sympy consistent show differences between runs. This is likely due to circular dependencies within the sympy code, but an investigation should verify this hypothesis. If possible, it would be preferable to improve the repeatability even when type checking is parallelized using "--threads".
This also seems to occur with the numpy stubs in numpy/numtype. When I run it without --threads, I see
0 errors, 0 warnings, 0 notes
But running it again with with --threads 2 gives
/projects/numtype/test/static/accept/arithmetic.pyi
/projects/numtype/test/static/accept/arithmetic.pyi:340:13 - error: "assert_type" mismatch: expected "timedelta64[None]" but received "timedelta64[timedelta | int | None]" (reportAssertTypeFailure)
1 error, 0 warnings, 0 notes
with --threads 3:
0 errors, 0 warnings, 0 notes
with --threads 4:
/projects/numtype/test/static/accept/arithmetic.pyi
/projects/numtype/test/static/accept/arithmetic.pyi:340:13 - error: "assert_type" mismatch: expected "timedelta64[None]" but received "timedelta64[timedelta | int | None]" (reportAssertTypeFailure)
1 error, 0 warnings, 0 notes
with 5, 6, 7, and 8 threads, there are no errors.
The error is reported at https://github.com/numpy/numtype/blob/13e4f4462bccf77f4a25f23b40d8c4c42665223d/test/static/accept/arithmetic.pyi#L340, and the relevant definition is at https://github.com/numpy/numtype/blob/13e4f4462bccf77f4a25f23b40d8c4c42665223d/src/numpy-stubs/init.pyi#L6524-L6536
This turns out to be an onion-peeling exercise. And like peeling a real onion, there may be some tears involved.
First, let me clearly state the goals of this exercise. There are three related goals:
- Ensure that multiple runs of pyright — even when running with the
--threadsoption — emit a repeatable set of diagnostics. We expect to see the same number of diagnostics associated with the same locations in the same files. - Ensure that the diagnostics are emitted in the same order each time.
- Ensure that the text for all of the diagnostics are identical across runs.
So far, I've achieved goal 2. This is done by simply sorting and de-duping the full list of diagnostics before they are emitted.
Goal 1 is partially achieved but not entirely. I've found and fixed a couple of bugs that contributed to order-dependent type evaluations. There are still an indeterminate number remaining.
One issue that I've found but not yet fixed is related to a feature called call-site return type inference. This is an expensive feature, so it is limited to just three call levels deep, and results are aggressively cached. However, that means it can produce different results depending on the order in which certain calls are encountered. For example, if function a calls b which calls c which calls d which calls e, if a is analyzed before b, the type of d will not be taken into account because it is more than three levels deep. If b is analyzed before a, then the call to e will be taken into account because it's just three levels deep, and the resulting return type of b will then be cached and used when a is eventually analyzed. I haven't yet come up with a good way to retain this feature while also maintaining reasonable performance and guaranteeing deterministic (order-independent) results. One option is to turn off this feature entirely, but completion suggestions for pylance users will then suffer. Another option is to retain completion suggestions but treat the return type as Unknown for purposes of type analysis (similar to what we do for ambiguous overload resolution). Another option is to greatly increase the depth limit — making it unlikely to hit, but this could create big performance issues (long hangs) in certain code bases like sympy and scipy.
Goal 3 is challenging because the different analysis ordering can produce unions whose subtypes appear in different orders. This shouldn't affect the number or locations of diagnostics, but it can affect the text of the diagnostic message. For example, the types int | str and str | int are equivalent, but their text differs. One solution here is to always sort the subtypes for a deterministic ordering when printing the type. However, there's value in trying to retain union subtype order in most cases because the textual output will more closely match type annotations provided by the developer. For example, if the developer provides the annotation int | str, it may be confusing if pyright were to rewrite it as str | int in hover text, completion suggestions, signature help, and in diagnostic messages. We could conditionally reorder only for diagnostic messages, but then there would be a visual difference between hover text and diagnostics — something that I think we want to avoid. I don't have a good solution to this problem. Maybe goal 3 isn't as important as goals 1 and 2 for most pyright users. It's still important if we want to have clean mypy_primer results during pyright development. I don't have a good solution to this one yet.
I'll continue to peel the onion, but I wanted to provide this update to record my interim findings and thoughts.
One issue that I've found but not yet fixed is related to a feature called call-site return type inference. This is an expensive feature, so it is limited to just three call levels deep, and results are aggressively cached. However, that means it can produce different results depending on the order in which certain calls are encountered.
Looking forward to evolution on this thread as I think this might be the source of disagreements we're seeing between pyright errors reported on a single file (e.g. while editing) versus errors reported when looking at the whole codebase (e.g. during CI pre-merge tests) (https://github.com/microsoft/pyright/issues/9642#issuecomment-2573981647)