GH-6114: Static path matching
This (early stage) WIP PR introduces a static path matcher which intends to emulate the behavior of the PHPUnit FileIterator in order to prevent PHPUnit traversing the filesystem when a deprecation is triggered.
The PHPUnit FileIterator uses glob to find directories and we therefore need to support the glob patterns - which can vary according to the platform. This PR uses https://man7.org/linux/man-pages/man7/glob.7.html as a reference in addition to testing the behavior locally to confirm assumptions.
The webmozart/glob provides a similar feature however it's behavior is different as it supports curly braces, and * is restricted to a single directory level, while * in PHPUnit will return all descendants and I'm sure there are other differences - however I've used that as a starting point.
TODO:
- [x] Escaping unhandled regex characters, or escaping the regex before parsing the pattern.
- [x] Character classes
[:alnum:]etc. - [ ] Collating symbols
- [ ] Equivalence class expressions
- [x] Emulating
globbehavior of unterminated[character groups. - [ ] Seems to be a PHPUnit globstar bug whereby
/a**will match/band all other directories, where as/ab*will not match anything. We can either copy that behavior or "fix" it.
~and maybe writing the implementation from scratch if regex turns out to be a bad fit.~
Usages on Github:
- wildcard: 1.8k https://github.com/search?q=directory+%22*%22+path%3A*.xml+path%3A**%2Fphpunit.xml+language%3AXML&type=code
- globstar: 378: https://github.com/search?q=directory+%22**%22+path%3A*.xml+path%3A**%2Fphpunit.xml+language%3AXML&type=code
- Character range/class (
directory.*\[): 0: https://github.com/search?q=%2Fdirectory.%5C%5B%2F+path%3A.xml+path%3A**%2Fphpunit.xml+language%3AXML&type=code ?any char: 0 https://github.com/search?q=%2Fdirectory.%5C%5B%2F+path%3A.xml+path%3A**%2Fphpunit.xml+language%3AXML&type=code
Have refactored to tokenize the glob string, while less performant it's easier to reason about and we only need to compile the regex for each <include or <exclude - not for each path.
@sebastianbergmann this PR is indicative of the approach but needs the final 20% of work:
- Refactor the source filter/mapper tests to create the paths they need in a temporary directory? this would make it easier to test edge cases without contaminating other tests and grant mode confidence.
- Implement the exclude/include logic (either by compiling that into the regex or as a separate concern).
- We can implement or forbid the use of
[[:character-ranges:]]and other "esoteric" glob patterns.
But I'm keen for any thought you have on the approach so far - the biggest concern for me is providing confidence that it will be compatible with at least 99.99% of configuration files.
@sebastianbergmann this PR is indicative of the approach but needs the final 20% of work:
* Refactor the source filter/mapper tests to create the paths they need in a temporary directory? this would make it easier to test edge cases without contaminating other tests and grant mode confidence. * Implement the exclude/include logic (either by compiling that into the regex or as a separate concern). * We can implement or forbid the use of `[[:character-ranges:]]` and other "esoteric" glob patterns.
Thank you for your work so far. I am afraid that I might not be able to look into this before the PHPUnit Code Sprint in April in Berlin. There I will be able to look into this, for sure, and together with other developers. I hope I can look into this before then, but I cannot promise that.
But I'm keen for any thought you have on the approach so far - the biggest concern for me is providing confidence that it will be compatible with at least 99.99% of configuration files.
We could delay this until PHPUnit 13 and break backward compatibility (for edge cases, "esoteric" glob patterns, etc.).
Is it expected that the tests are currently failing?
Yes - the prefix/suffix filtering hasn't been added yet and the tests fail due to the tests we added upstream. The PR is at an inflection point:
- Do we want to commit to supprting all the glob features? or could we simplify it.
- Do we want to refactor the source-finding/filtering tests to be more dynamic?[1]
- Are we happy with the general approach in this PR?
And then I was also considering how to implement the suffix/prefix logic - as it probably makes sense to include them in the compiled regex - but then it intrudes on the "copy the behavior of glob" approach" - so if we decide to simplify the globbing behavior it makes that decision easier too.
[1] currently all the source mapping/filtering tests share the same fixture which makes it hard to test edge cases without breaking other tests - I think it could generate the file tree instead.
I discussed this today with @edorian, @theseer, and @sebastianheuer at the PHPUnit Code Sprint. Even for more complex <source> configurations, we were unable to measure a runtime of more than 150 ms for the current implementation of SourceMapper::map(). We do not think that a potential reduction in runtime would justify a more complicated implementation.
Can you share how much time is spent in SourceMapper::map() in your environment and what the <source> section of your configuration looks like?
Hi @sebastianbergmann - this is the repoducer repository - on my local envrionment it pauses for around 3-4 seconds while loading the source map.
Hi @sebastianbergmann - this is the repoducer repository - on my local envrionment it pauses for around 3-4 seconds while loading the source map.
Thank you. How much does the time spent go down with your new implementation?
filtering without file traversal basically removes the performance issue (the performance is similar to not having a source-map at all). the only issue really is replicating the glob behavior - or deliberately breaking compatiblity and simplifying the behavior to some extent.
Thank you. I have to say, though, that I do not think that 3-4 seconds is too bad considering there are 120K files in the reproducer. I am now less convinced than I originally was that this is an issue worth fixing, sorry.
On the project for which I reported this issue it takes 1 minute and 22 seconds. We've had to remove the source mapping entintirely in order to be able to use PHPUnit. I will endeavour to update the reproduction repository to replicate that.
I thought I'd share that this is also a problem for others who are running PHPUnit on relatively large codebases. We have ~9.2k files and executing this loop takes ~8 seconds on my machine the first time E_USER_DEPRECATED is emitted. It's not the biggest issue in the world for us and the workaround is to fix all deprecation warnings (easier said than done) but addressing this would really improve our quality of life. Thank you for your consideration!
@sebastianbergmann and I had a call and decided to simplify the approach in the PR to support the bare minumum glob syntax and that it would be part of the next major release and merged around December time.
Once we have the new implementation, we could introduce it into the current stable version in addition to the old implemention. When old and new implementation "disagree" we could then emit a deprecation warning.
@dantleech Did you have a chance to work on this yet? Thanks!
hi, no not yet, need to find some time
@dantleech FYI: main is not PHPUnit 13.0-dev (where these changes should land). This hopefully makes working on this easier/simpler/more convenient.
would you be option to using a third-party package, namely webmozarts/glob? We could evaluate it with that implementation at least I guess.
Sure, we can evaluate @webmozart's package. I am usually reluctant to add a new dependency (that I did not create / do not maintain myself), but if this package provides what we need and is maintained on the PHP versions we need then I do not see a problem.
Replaced by #6398