Discovery cache improvements
Let's look into whether we can improve discovery cache for local development
- Partial discovery cache: only discover files that have been changed or added according to local git history
- Always cache vendor discovery: vendor files can only be changed on composer update/install. Let's add a post-install script that will re-store discovery cache, but let's not re-scan vendor files when not needed
- Have a discovery cache manager so that paths don't have to be hard-coded in each discovery class
@brendt - I have some thoughts here...
While less developer friendly, have you tried parsing files with nikic/php-parser at all? I'd imagine it'd be faster, but unsure. We'd also avoid some runtime issues we have today (e.g., if you run discovery across all files in the vendor dir, you might load a Symfony class relying on a dependency not installed, etc.).
I'm wondering if we could build a few helpers over top with some basic AST checks (e.g., Attributes, whether a class extends another, whether a class implements another, methods, parameters, etc.).
We might be able to also use this method to pull caching out of the hands of individual discoverers and only handle it at a discovery manager level.
While less developer friendly
😱
have you tried parsing files with nikic/php-parser at all? I'd imagine it'd be faster, but unsure
Why? Compared to reflection? Noooo definitely not, reflection is super fast
if you run discovery across all files in the vendor dir, you might load a Symfony class relying on a dependency not installed,
We shouldn't though? We're only scanning Tempest packages. On top of that, we're already catching runtime exceptions and are simply not loading those classes
I'm wondering if we could build a few helpers over top with some basic AST checks (e.g., Attributes, whether a class extends another, whether a class implements another, methods, parameters, etc.).
We have this with the reflector helpers — I find they really useful, you don't?
We might be able to also use this method to pull caching out of the hands of individual discoverers and only handle it at a discovery manager level.
That is the goal, but I don't see why we'd need to switch reflection for it?
We shouldn't though? We're only scanning Tempest packages. On top of that, we're already catching runtime exceptions and are simply not loading those classes
But what exactly is a tempest package? Can I make m own tempest package which also gets autodiscovered?
@Wulfheart every package that has a dependency on a tempest core package is considered a tempest package. Right now you'd need to require eg. tempest/core and your package will be discovered. However, I plan on adding some kind of "package tools package" that all third party packages should require.
Store discovery cache as plain PHP instead of manually serializing it
I played around with several cache pool adapters, the fastest one is actually serializing the payload and storing it as a file. Eg. PhpFileCacheAdapter is slower. There might be room for much more improvement, but I'm going to skip this refactor for now. I still plan on looking into partial caching though.
Do you have benchmarks of Tempest with and without the cache? I'm curious as to how Discovery impacts performance
Taking note of some benchmarks, this is time ./tempest
| branch | cache enabled | cache disabled | partial cache enabled |
|---|---|---|---|
| framework (main) | 0.12s | 0.16s | / |
| framework (partial) | 0.11s ✅ | 0.21s ❗️ | 0.20s ❗️ |
| project (main) | 0.11s | 0.15s | / |
| project (partial) | 0.11s ✅ | 0.15s ✅ | 0.10s ✅ |
| big project (main) | 0.42s | 0.74s | / |
| big project (partial) | 0.36s ✅ | 0.71s ✅ | 0.58s ✅ |
Note that the framework code will benefit less or not from these changes since there are no vendor files to discover in the monorepo.
The main performance bottleneck comes from the following: if we want to allow partial discovery, we need to introduce extra checks to determine whether a specific discovery location should be crawled for a specific discovery class. Eg: if partial cache is enabled, all discovery locations in vendor/ should be skipped, but ONLY if there is a valid cache entry for it. If that's not the case, we have to warm the cache the first time. This per-location check adds overhead, since we need now need to check cache status for every location/discovery combination.
BTW, this is the branch I'm working on: https://github.com/tempestphp/tempest-framework/tree/partial-discovery-cache
Update: if we manually generate the discovery cache with discovery:generate command (run it on deploy and after composer install), we won't have to worry about doing extra checks. The upside is that partial discovery does indeed have performance gain in that scenario. The downside it that people will need to remember to run discovery:generate. We can automate it as much as possible, and we can add error messages to let people know they forgot to run it in some cases.
Benchmarks for running our testsuite
| branch | test suite |
|---|---|
| main | 00:16 |
| partial-discovery-cache | 00:15 ✅ |
TODO after merge:
- [x] Document partial discovery cache
- [x] Composer hook to automatically generate discovery cache
- [x] Testing, but it should be done outside our normal test suite (since clearing caches affects the whole testsuite). I think the best approach is to move discovery-related cache tests to the
tempest/apptestsuite
Taking note of some benchmarks, this is
time ./tempestbranch cache enabled cache disabled partial cache enabled framework (main) 0.12s 0.16s / framework (partial) 0.11s ✅ 0.21s ❗️ 0.20s ❗️ project (main) 0.11s 0.15s / project (partial) 0.11s ✅ 0.15s ✅ 0.10s ✅ big project (main) 0.42s 0.74s / big project (partial) 0.36s ✅ 0.71s ✅ 0.58s ✅
To make some sense of those numbers I would like to know approximately how many files and classes are in the big project and what is the machine specs that run time ./tempest.
Looking at those numbers, I am a bit worried about how it performed. Taking around a minute to warm the cache is pretty significant.
Taking around a minute to warm the cache is pretty significant
It didn't. 1 minute = 60 seconds. The longest number here is 0.74s which is 740 milliseconds or 0.0123 minutes.
I still think we can get that down considerably.
are in the big project and what is the machine specs that run time ./tempest.
These benchmarks shouldn't be used to compare absolute numbers, but I can give some more info:
- The big project benchmark counted 2000 generated vendor files and 4000 generated source files. We use https://github.com/tempestphp/tempest-benchmark/blob/main/setup/setup.php to generate such projects (although, this version doesn't include the vendor files, that was a local change that I still have to push)
- It was run on a Mac Mini M1
- As Aidan pointed out: half a second second, not 1 minute :)
Oh, sorry. My bad. I guess that's not to warm the cache then and it is just to pass thought the checks after the cache is already there.
But that's a latency in every request made. Just to put it out here, the welcome page for symphony in a dev env is 42ms in my machine (less powerful than a Mac mini m1).
I guess that's not to warm the cache then and it is just to pass thought the checks after the cache is already there.
I'd suggest taking another look at the table above. That is with cache disabled.
I guess that's not to warm the cache then and it is just to pass thought the checks after the cache is already there.
I'd suggest taking another look at the table above. That is with cache disabled.
I guess something else is going on, because by that table it would be 220ms of overhead just for checking the partial caching, isn't it?