ruff
ruff copied to clipboard
Explore use of globset
Can globset speed up our file-matching code?
Quick question on the scope of this issue. I assume you are speaking about https://cs.github.com/charliermarsh/ruff/blob/1d13752eb1e33632c8f83b84004958b1c7c0c99d/src/fs.rs#L29 and more specifically what happens when matching against a complex
exclusion rule, is that correct ?
In that case, I'm just dropping some food for thoughts here, and in particular referencing to this github answer by the author of globset
on how it compares to glob
(the lib currently used in ruff) https://cs.github.com/charliermarsh/ruff/blob/1d13752eb1e33632c8f83b84004958b1c7c0c99d/src/fs.rs#L29
I think what's important to notice here is that choosing between libs is a tradeoff, globset
was optimized for the specific case of matching a single file against a large set of glob patterns by compiling the set of globs into a set of regexes, moving from O(n) where n is number of globs to O(1), but it adds a fixed overhead when it comes to compiling the regexes compared to glob
. However, it might make sense if we have a large list of files to filter since we would be able to compile the regex only once for all of them.
The tables below roughly sums up my understanding of the performances of each crate.
Glob | Few globs in excludes | Lots of globs in excludes |
---|---|---|
Few files selected | efficient | inefficient, linear in number of globs to match for |
Lots of files selected | not sure, probably linear in number of files | inefficient, quadratic in the two variables |
GlobSet | Few globs in excludes | Lots of globs in excludes |
---|---|---|
Few files selected | probably inefficient if a regex must be compiled, although the author implemented optimisations to avoid it | efficient thanks to regex |
Lots of files selected | probably efficient since regex compilation is offset by number of files to match it on | efficient, and regex compilation is offset as well |
I think a first step to this task is to define which case(s) out of the 4 defined in these tables we want to optimise for.
Regarding the second column, my personal understanding is that there are very few (big) projects that have long and complex exclude lists with a lot of globs (If you have some counter examples in mind, please point me to them, I'd be curious to see them, and they could become good testing material for experimenting with globset
. I had a glance at pydantic
which is the only large python project using ruff
that I'm aware of so far, and I don't see an exclude list in their config sadly 😢).
In the case we want to optimize for the first column, there might be some wins to make if the list of files to evaluate gets longer, thanks to compiling the regex only once, but we will need figures to back this claim, and identify the tipping point in terms of number of files to make globset
more efficient than glob
. We also need to verify whether the optimisations made by the author to avoid compiling regexes are good enough to match glob
perfs when using few files and few globs.
That being said, I'm curious to see what experimentations with globset
could bring. I guess that clarifying expectations and devising a good benchmarking framework should be the first things to do to tackle this issue. Anyway, that was my 2cts on optimizing file matching 😄 Hope it will help whoever gives a shot at this.
That's really helpful! You're right that this is something we need to evaluate empirically... (And yes, I was referring to the file matching in the Complex
case.)
The way that I'd typically benchmark this would be to clone cpython
, then add this pyproject.toml
file:
[tool.ruff]
line-length = 88
extend-exclude = [
"Lib/lib2to3/tests/data/bom.py",
"Lib/lib2to3/tests/data/crlf.py",
"Lib/lib2to3/tests/data/different_encoding.py",
"Lib/lib2to3/tests/data/false_encoding.py",
"Lib/lib2to3/tests/data/py2_test_grammar.py",
"Lib/test/bad_coding2.py",
"Lib/test/badsyntax_3131.py",
"Lib/test/badsyntax_pep3120.py",
"Lib/test/encoded_modules/module_iso_8859_1.py",
"Lib/test/encoded_modules/module_koi8_r.py",
"Lib/test/test_fstring.py",
"Lib/test/test_grammar.py",
"Lib/test/test_importlib/test_util.py",
"Lib/test/test_named_expressions.py",
"Lib/test/test_patma.py",
"Lib/test/test_source_encoding.py",
"Tools/c-analyzer/c_parser/parser/_delim.py",
"Tools/i18n/pygettext.py",
"Tools/test2to3/maintest.py",
"Tools/test2to3/setup.py",
"Tools/test2to3/test/test_foo.py",
"Tools/test2to3/test2to3/hello.py",
]
(Or maybe add some *
or other patterns instead of all these fixed patterns.) That's actually what we currently use for benchmarking all of Ruff against CPython.
Then I'd change src/linter.rs#lint_path
to just return Ok(vec![])
or something, so that we're mostly just benchmarking the file collection.
Then I'd run something like:
cargo build --release && hyperfine --warmup 10 --runs 50 -i "./target/release/ruff resources/test/cpython --no-cache"
...and do that for a few different combinations to get a sense for what makes a difference here.
From what I could tell looking into this - it's looking hopeful
simply dropping in globset::GlobMatcher
instead of glob::Pattern
slowed things down (see img 1), but refactoring and basically gutting FilePattern
- and having fs::is_excluded
take 1 globset::GlobSet
instead of an Iterator<item=FilePattern>
then there are performance gains to get gotten (img 2)
my code is still rather messy (a lot of .expects
because globset
really likes its Options) but I will take some time and clean it up and give it to yall for review.
IMG 1 (drop in)
IMG 2 (refactor)
@CelebrateVC - Awesome! Happy to review whenever it's ready for a look.
Done thanks to @CelebrateVC!