ruff Explore use of globset

Can globset speed up our file-matching code?

Nov 08 '22 21:11 charliermarsh

Quick question on the scope of this issue. I assume you are speaking about https://cs.github.com/charliermarsh/ruff/blob/1d13752eb1e33632c8f83b84004958b1c7c0c99d/src/fs.rs#L29 and more specifically what happens when matching against a complex exclusion rule, is that correct ?

In that case, I'm just dropping some food for thoughts here, and in particular referencing to this github answer by the author of globset on how it compares to glob (the lib currently used in ruff) https://cs.github.com/charliermarsh/ruff/blob/1d13752eb1e33632c8f83b84004958b1c7c0c99d/src/fs.rs#L29

I think what's important to notice here is that choosing between libs is a tradeoff, globset was optimized for the specific case of matching a single file against a large set of glob patterns by compiling the set of globs into a set of regexes, moving from O(n) where n is number of globs to O(1), but it adds a fixed overhead when it comes to compiling the regexes compared to glob. However, it might make sense if we have a large list of files to filter since we would be able to compile the regex only once for all of them.

The tables below roughly sums up my understanding of the performances of each crate.

Glob	Few globs in excludes	Lots of globs in excludes
Few files selected	efficient	inefficient, linear in number of globs to match for
Lots of files selected	not sure, probably linear in number of files	inefficient, quadratic in the two variables

GlobSet	Few globs in excludes	Lots of globs in excludes
Few files selected	probably inefficient if a regex must be compiled, although the author implemented optimisations to avoid it	efficient thanks to regex
Lots of files selected	probably efficient since regex compilation is offset by number of files to match it on	efficient, and regex compilation is offset as well

I think a first step to this task is to define which case(s) out of the 4 defined in these tables we want to optimise for.

Regarding the second column, my personal understanding is that there are very few (big) projects that have long and complex exclude lists with a lot of globs (If you have some counter examples in mind, please point me to them, I'd be curious to see them, and they could become good testing material for experimenting with globset. I had a glance at pydantic which is the only large python project using ruff that I'm aware of so far, and I don't see an exclude list in their config sadly 😢).

In the case we want to optimize for the first column, there might be some wins to make if the list of files to evaluate gets longer, thanks to compiling the regex only once, but we will need figures to back this claim, and identify the tipping point in terms of number of files to make globset more efficient than glob. We also need to verify whether the optimisations made by the author to avoid compiling regexes are good enough to match glob perfs when using few files and few globs.

That being said, I'm curious to see what experimentations with globset could bring. I guess that clarifying expectations and devising a good benchmarking framework should be the first things to do to tackle this issue. Anyway, that was my 2cts on optimizing file matching 😄 Hope it will help whoever gives a shot at this.

Nov 12 '22 16:11 LeBoucEtMistere

That's really helpful! You're right that this is something we need to evaluate empirically... (And yes, I was referring to the file matching in the Complex case.)

The way that I'd typically benchmark this would be to clone cpython, then add this pyproject.toml file:

[tool.ruff]
line-length = 88
extend-exclude = [
    "Lib/lib2to3/tests/data/bom.py",
    "Lib/lib2to3/tests/data/crlf.py",
    "Lib/lib2to3/tests/data/different_encoding.py",
    "Lib/lib2to3/tests/data/false_encoding.py",
    "Lib/lib2to3/tests/data/py2_test_grammar.py",
    "Lib/test/bad_coding2.py",
    "Lib/test/badsyntax_3131.py",
    "Lib/test/badsyntax_pep3120.py",
    "Lib/test/encoded_modules/module_iso_8859_1.py",
    "Lib/test/encoded_modules/module_koi8_r.py",
    "Lib/test/test_fstring.py",
    "Lib/test/test_grammar.py",
    "Lib/test/test_importlib/test_util.py",
    "Lib/test/test_named_expressions.py",
    "Lib/test/test_patma.py",
    "Lib/test/test_source_encoding.py",
    "Tools/c-analyzer/c_parser/parser/_delim.py",
    "Tools/i18n/pygettext.py",
    "Tools/test2to3/maintest.py",
    "Tools/test2to3/setup.py",
    "Tools/test2to3/test/test_foo.py",
    "Tools/test2to3/test2to3/hello.py",
]

(Or maybe add some * or other patterns instead of all these fixed patterns.) That's actually what we currently use for benchmarking all of Ruff against CPython.

Then I'd change src/linter.rs#lint_path to just return Ok(vec![]) or something, so that we're mostly just benchmarking the file collection.

Then I'd run something like:

cargo build --release && hyperfine --warmup 10 --runs 50 -i "./target/release/ruff resources/test/cpython --no-cache"

...and do that for a few different combinations to get a sense for what makes a difference here.

Nov 12 '22 17:11 charliermarsh

From what I could tell looking into this - it's looking hopeful

simply dropping in globset::GlobMatcher instead of glob::Pattern slowed things down (see img 1), but refactoring and basically gutting FilePattern - and having fs::is_excluded take 1 globset::GlobSet instead of an Iterator<item=FilePattern> then there are performance gains to get gotten (img 2)

my code is still rather messy (a lot of .expects because globset really likes its Options) but I will take some time and clean it up and give it to yall for review.

IMG 1 (drop in)

IMG 2 (refactor)

Nov 21 '22 01:11 CelebrateVC

@CelebrateVC - Awesome! Happy to review whenever it's ready for a look.

Nov 21 '22 02:11 charliermarsh

Done thanks to @CelebrateVC!

Nov 25 '22 03:11 charliermarsh

ruff ruff copied to clipboard

Explore use of globset

IMG 1 (drop in)

IMG 2 (refactor)

ruff
ruff copied to clipboard