ruff icon indicating copy to clipboard operation
ruff copied to clipboard

Restore Salsa DB for exploring Salsa further

Open MichaReiser opened this issue 1 year ago • 1 comments
trafficstars

MichaReiser avatar May 08 '24 13:05 MichaReiser

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

github-actions[bot] avatar May 08 '24 14:05 github-actions[bot]

CodSpeed Performance Report

Merging #11338 will not alter performance

Comparing red-knot-salsa (06ee178) with main (2e0a975)

Summary

✅ 30 untouched benchmarks

codspeed-hq[bot] avatar May 28 '24 08:05 codspeed-hq[bot]

There's one limitation with the current model where the invalidation isn't as good as it could be and it is due to the fact that we build the entire symbol table at once (we don't have to and we could refactor that later).

Let's say we start with

# main.py
import foo;

x = foo.x

# foo.py
x = 10

def foo(): 
	pass

And we infer the type of x in main. To do this, the implementation runs

  • It parses main , builds its symbol table and cfg, calls into infer_module_body
  • It resolves foo when reaching import foo
  • It calls ModuleType.member when reaching foo.x. This fans out to parse foo, build its symbol table and control flow graph, and then runs module-level type inference for foo
  • ...

When we now change the content of foo to

x = 10

def foo(): 
	y = 10

What I expected is that the type inference for main wouldn't re-run because the module-level types of foo remain unchanged. However, that's not the case. The reason is that foo.x resolved the symbol table of foo.py and the symbol table has changed because we introduced y in the scope of foo. We have the same problem when a flag of an enclosing symbol changes. For example if the body of foo.foo is changed to y = x. The symbol table of that module changes because x now has the flag used.

We can avoid this by also building the symbol table per scope rather than once globally. Or have a query that reduces the global symbol table to just the global symbols. I do think something like that would be nice to have more fine granular invalidations.

MichaReiser avatar Jun 07 '24 19:06 MichaReiser

This is a great write-up, thanks for taking the time! A few thoughts from the write-up, before I dive into the code:

  1. One caveat with building symbol tables per-scope is that the contents of nested functions can affect the symbol table of the enclosing function. (If a nested function uses nonlocal x and assigns to x, the symbol kind for x may change based on the knowledge that a nested function might assign to it.) So we can build symbol tables separately for module scope, class scope, and function scope -- but building a function symbol table needs to build all enclosed symbol tables (which may involve N nested class and function scopes, to arbitrary depth.) I think in practice this is still fine, though, and worth doing the split. Nesting isn't that common, and by definition the symbols inside a function scope aren't ones that another module can depend on, anyway.
  2. My initial feeling is that what you are calling Module maybe should be called ModuleName, and what you are calling ResolvedModule maybe should be called Module. But I'm not totally sure until I look closer at the code.
  3. I think if we are doing type inference per-scope rather than per-expression that will probably also recommend changes to how type inference works in the first place. We can probably just walk the AST for the scope, assigning types to expressions as we go, and similarly tracking narrowed local types for local symbols as we go. This will mean we don't even need FlowGraph anymore (a lot of the concepts developed in building it will still carry over, it will just be resolved eagerly instead.)

carljm avatar Jun 07 '24 21:06 carljm

One caveat with building symbol tables per-scope is that the contents of nested functions can affect the symbol table of the enclosing function.

The part that's unclear to me is how we would compute flags like USED when building the symbol table lazily per scope, because a symbol might be used in a child scope. But we also have the option to build the entire symbol table at once and then split it per scope (similar to the trick with SymbolIndex, build it once, then have sub-queries that only return a slice of that.

My initial feeling is that what you are calling Module maybe should be called ModuleName, and what you are calling ResolvedModule maybe should be called Module. But I'm not totally sure until I look closer at the code.

Yeah, but that would require that ModuleName becomes a sala ingredient. It might be fine but it makes ModuleName a bit more awkward to use. But if we keep ModuleName a regular struct, than what would you call the module thing that we pass to resolve_module? I might be overthinking this because we can make the argument to resolve_module private and only expose a resolve_module(db: &dyn Db, name: &str) function that internally converts the &str to that Module thing and we can then keep our existing terminology. Maybe that's for the best (it also hides complexity).

My thinking why I called the Module { name: ModuleName } a module is that I don't think that the existence of the file on the disk makes a module. When we have import foo, then there's an import of the foo module, regardless if that module exists or not. That's why I think that a module's identity is really defined by its name.

We can probably just walk the AST for the scope, assigning types to expressions as we go, and similarly tracking narrowed local types for local symbols as we go. This will mean we don't even need FlowGraph anymore (a lot of the concepts developed in building it will still carry over, it will just be resolved eagerly instead.)

This is what the new implementation does. But I must say, it would make me sad to see your CFG go away. I think it could be useful for other things than just typing, like an unreachable rule.

MichaReiser avatar Jun 08 '24 06:06 MichaReiser

The part that's unclear to me is how we would compute flags like USED when building the symbol table lazily per scope, because a symbol might be used in a child scope.

The USED flag only means "used in current scope", so it's not an issue. The only analysis that crosses scopes is the one Patrick was working on, and that's the case I discussed in my comment; but it only applies to nested scopes within function scopes, so handling a function scope and all it's nested scopes together is one way to handle it; module scopes and class scopes not nested in functions can be fully independent.

build it once, then have sub-queries that only return a slice of that.

This could work, too.

When we have import foo, then there's an import of the foo module, regardless if that module exists or not. That's why I think that a module's identity is really defined by its name.

This is a good point. I think it's a solid enough reason for the current naming. We will need to be able to track dependencies on nonexistent modules.

This is what the new implementation does.

"tracking narrowed local types for local symbols as we go" is the part that would specifically replace the CFG; I don't think the implementation here does that yet.

it would make me sad to see your CFG go away. I think it could be useful for other things than just typing, like an unreachable rule.

The eager version of the same logic would also have the ability to discover unreachable branches.

carljm avatar Jun 08 '24 16:06 carljm

Module Resolver

The module resolver remains mostly unchanged, although I did some renaming.

Module

My initial feeling is that what you are calling Module maybe should be called ModuleName, and what you are calling ResolvedModule maybe should be called Module. But I'm not totally sure until I look closer at the code.

Yeah, but that would require that ModuleName becomes a sala ingredient. It might be fine but it makes ModuleName a bit more awkward to use. But if we keep ModuleName a regular struct, than what would you call the module thing that we pass to resolve_module? I might be overthinking this because we can make the argument to resolve_module private and only expose a resolve_module(db: &dyn Db, name: &str) function that internally converts the &str to that Module thing and we can then keep our existing terminology. Maybe that's for the best (it also hides complexity).

I wonder if what's currently called Module could be renamed to ModuleRequest. The user "requests" a module (and the request is represented with a ModuleRequest instance) by importing a module with a certain ModuleName, but they might not actually get a module back, because the module might not actually exist. Unlike the Module type on the main branch, your Module type in crates/red_knot/src/salsa_db/semantic/module.rs doesn't really feel like a Module to me, as you can't query any information about the module directly from the type -- you have to resolve it first, and it feels like the module object is the thing you get given at the end of the resolution process.

AlexWaygood avatar Jun 10 '24 14:06 AlexWaygood

Thanks @AlexWaygood for the feedback. I don't plan to incorporate any of the code changes into this PR because I don't plan on merging. I'll incorporate your changes when working on the specific areas before pulling them into ruff.

MichaReiser avatar Jun 10 '24 14:06 MichaReiser