astsearch icon indicating copy to clipboard operation
astsearch copied to clipboard

Index ASTs

Open takluyver opened this issue 10 years ago • 4 comments

Without indexing, searching the IPython codebase takes a little under 4s (on my PC). Of this, about 0.8s is parsing, and 2.5s is walking the nodes. This is usable, but not ideal, and could be annoying for very large codebases.

This could be improved by indexing the ASTs. Querying a database for the root node type of the pattern should save having to iterate through many things that will never match.

This article has an interesting pattern for storing nested data structures in an SQL database: http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/

takluyver avatar May 02 '14 00:05 takluyver

FWIF I used this with sympy and it didn't seem too slow to me.

asmeurer avatar May 02 '14 01:05 asmeurer

Thanks, that's useful to know. Sympy is quite a bit bigger than IPython, according to Ohloh. If you have a spare moment to time a simple search, I'd be interested in how long it does take.

takluyver avatar May 02 '14 01:05 takluyver

Searching ?/? took 27 seconds (by comparison git grep ./. sympy/ > /dev/null takes 0.3 seconds). Searching for ?.args took 14 seconds. Searching for ?(?,?) took 30 seconds.

What affects performance the most? Is it the number of ?s in the expression? The complexity of the expression? The number of positive matches?

asmeurer avatar May 02 '14 02:05 asmeurer

You've already done more careful measurements than me ;-). In my brief experiments with IPython, different simple queries all seemed to take about the same time. When I tried profiling it, walking the actual ASTs from the files was the biggest cost.

The first and simplest check is for the type of the root node in the pattern, e.g. ast.Attribute for ?.args. For every node that matches that, it will start the more complex checking implemented by astcheck, which effectively walks the pattern AST and compares each part of it to the sample. With ?.args, there's only one thing to check (sample.attr == 'args'). But for ?(?,?), I think it should end up with nothing to check at all, so presumably in that case it's the number of values being yielded that takes the time. I'm surprised that the difference is that big, though.

takluyver avatar May 02 '14 02:05 takluyver