dependency-cruiser
dependency-cruiser copied to clipboard
Question: Parallelization of AST Parsing (Plus a few other things)
Summary
Have a few questions
Context
First of all, want to start saying that dependency-cruiser is a great tool. We've been using it for a while at BrexHQ, and has made a few things a lot easier. So thanks for all the work you have done with it.
We are mainly using it to enforce architecture patterns across our codebase, structure migrations and detect what sections of our codebase have the most "violation" in order to prioritize work (Side-note, really helpful to structure long and gradual migrations 🔥)
I do have a question (/suggestion/me asking for pointers). Right now for all our main frontend codebase, (which is 100% typescript), it takes the full dependency analysis around 2 minutes to finalize. We are then running a few html-reports over it on CI and reporting the artifacts in a few ways.
2 minutes is not a long time in the grand scheme of things, however in our specific case, we are projecting a lot of growth on that codebase specifically, and we are also looking into a mono-repo structure for a lot of JS code (Which...yeah, that'll be huge 😅 )
So enough context, to my question. Been looking into the AST parsing sections of this cod and had a couple of questions:
- The change from swc to tsc, I'm wondering why it happened? Saw a comment on What were the features you did not see in SWC that made you default for TSC? Is this something you are thinking on revisiting after a while?
- (I haven't done any measuring, so I'm only assuming) Am I correct in thinking the AST parsing sections of the codebase are the slowest ones? If so, have you looked into parallelization of it?
- My thought process was that a mixture of workers and shared-memory-arrays could help avoiding the serialization.
- Maybe for TS you need the whole project to be loaded in memory first? (Unsure if that's what typescript.createSourceFile is doing 🤔 )
- I guess, same question goes for other non-tsc extractors.
- Have you thought on supporting a templating language for rule's
comment
field?- This is mainly to improve DX for engineers that see reports but have little-to-no idea of depenency-cruiser's configuriation, for example, I have seen myself doing some of the following: (see the usage of
allowedPaths
/rulePathText
). -
const restrictedImportFactory = (path, ruleId, allowedPaths) => { // We allow certain imports for this file const allowedPaths = [ ...allowedPaths, "src/__generated__/globalTypes.ts", ]; const rulePathText = allowedPaths.map( (rulePath) => ` - ${rulePath}\n`, ); const id = slugify(ruleId); return { // Prevent a directory from reaching into other domains name: `module-has-restricted-imports---${id}`, comment: `"${path}" is defined to have specifically restricted imports, that means it should NOT import code from any other file than specifically allowed files. Allowed imports: ${rulePathText}`, severity, from: { path: [path] }, to: { path: "src/.*", pathNot: allowedPaths, preCompilationOnly: shouldConsiderOnlyTypes, }, }; };
- For example, something like
-
comment: ` Allowed Rules: {{#each rule.to.pathNot as |path pathId| }} - {{ path }} {{/each}} `
- This is mainly to improve DX for engineers that see reports but have little-to-no idea of depenency-cruiser's configuriation, for example, I have seen myself doing some of the following: (see the usage of
Related to this, had another thoughts/questions, (maybe I can put it on another thread to better discuss? LMK)
One thing I find myself often reaching for, is a way to analyze what a module that I'm depending on is exporting. Like, if we could expose import-specific information, we could be able to do more granular rules 🤔 (Started thinking of if after #588)
For example, let's say we a file performanceHelpers.ts
that exports maaaany things (bunch of types, default export, several named exports) and amongst all of them, a function measureControllerPerformance
.
If I want to enforce that fileThatNeedsMeasurement.ts
imports performanceHelpers.ts
, I can do it... However, i cannot enforce that measureControllerPerformance
.
(We can argue that we cannot enforce measureControllerPerformance
is being used... so what's the point 😆 but it gives us users a bit control over their rules)
We could also expose a module's "export information" and be able to create rules that enforce that a xxxController.tsx
for example, has to have a specifically named export. etc
Sorry for the wall of text 😅 I had a few thoughts after using this tool for a bit, and wanted to pick your brain, maybe you have thought of all of these already and decided for-against them, or not.
Been getting familiarized with the code, and I'd be more than happy to either help on any of this or tbh, just discuss about it. Didn't want to come across as the "fix your opensource stuff for me" kind of reporter 😅
Either way, thanks a lot for the work you've done here 🙏
Environment
- Version used:
- Node version:
- Operating System and version:
- Link to your project:
Hi @fforres thanks for taking the time for that write-up! Learning how dependency-cruiser is used is very interesting to me - and helps a lot in making the tool better. I've tried to respond to the questions you raise one by one, but it'll be in several instalments :-)
swc
vs tsc
Dependency-cruiser still supports using swc as a parser (see the parser paragraph on the bottom of the options reference if you’d like to try it. It does require swc to be installed in the same spot as dependency-cruiser is, though).
From (almost) the beginning dependency-cruiser has used tsc. I’ve added (experimental) support for swc in the hope it would make run dependency-cruiser faster. It did - but not as dramatically as I’d hoped (my benchmarks indicated a ~15% performance improvement in the step it was run in and 6% for dependency-cruiser over all).
swc
is not at an 1.x.x version yet. Features not supported last time I checked:
- triple slash directives
- decorators (heavily used in my day-job’s code base)
- it doesn’t process some .d.ts correctly (see const in ambient context incorrectly flagged as uninitialised. The last comment from the author in that issue seems to indicate this is intentional).
(Also see the original PR: https://github.com/sverweij/dependency-cruiser/pull/451)
The reason I stopped using it as the compiler to check dependency-cruiser on itself is that dependency-cruiser has .d.ts files which swc doesn’t process well.
Speed - where does dependency-cruiser spend most of its time?
It used to be other things (not lazy-loading node modules on startup, a subtle performance problem in enhanced-resolve) but these days it’s “visiting dependencies” which consists of a.o. I/o (reading files), parsing and resolving dependencies to disk.
This is the breakdown for running through the react codebase; running through 1540 modules, 4724 dependencies (dependency-cruiser emits this when passed --progress performance-log
)
elapsed heapTotal heapUsed after step...
824ms 82Mb 55Mb start of node process
8ms 82Mb 56Mb parsing options
156ms 91Mb 60Mb parsing rule set
0ms 91Mb 60Mb making sense of files and directories
0ms 91Mb 60Mb determining how to resolve
0ms 91Mb 60Mb reading files
53ms 91Mb 69Mb reading files: gathering initial sources
8174ms 303Mb 276Mb reading files: visiting dependencies
1ms 303Mb 276Mb analyzing
1374ms 306Mb 279Mb analyzing: cycles
0ms 306Mb 279Mb analyzing: dependents
24ms 306Mb 277Mb analyzing: orphans
0ms 306Mb 277Mb analyzing: reachables
0ms 306Mb 277Mb analyzing: module metrics
0ms 306Mb 277Mb analyzing: add focus (if any)
65ms 307Mb 270Mb analyzing: validations
116ms 308Mb 277Mb analyzing: comparing against known errors
6ms 308Mb 278Mb reporting
1ms 308Mb 278Mb really done (10804ms)
The relatively small difference between swc and tsc (which on itself is so much faster than tsc it’s not even funny) supports my hypothesis the bottleneck currently isn’t in the parsing itself.
I do think parallelisation can make the step faster - but I'm a bit hesitant because running things in parallel can make code complicated to follow and prone to bugs when not done right.
If I’d have to set my money on something to make dependency-cruiser faster, it’d be using the outcome of the previous run and only re-investigating the diff. It’s not on the roadmap yet, but I'm investigating the possibilities.
Last time I checked (a few weeks back) on my day job's codebase dependency-cruiser was a bit faster than eslint. So while speeding up things is always on my mind I've put focus on other things in the mean time.
Tweaking performance with options
There are ways to make dependency-cruiser process faster as-is b.t.w. - I don’t think this is covered in the documentation yet, so I’ll put this write-up somewhere on there.
At its default settings it favours correctness over speed - erring on the side of cautiousness. Things that influence performance, loosely in order of impact:
- moduleSystems - reducing the number of module systems to the ones you actually use (these days likely only
es6
and possiblycjs
) will speed up processing - tsPreCompilationDeps - with this on
true
dependency-cruiser runs faster as with that it directly analyzes the typescript AST instead of first transpiling it down to javascript first. - enhancedResolveOptions.extensions - setting this to only the extensions you actually use (e.g.
.ts
) will make resolving faster. By default dependency-cruiser initialises this array to support all extensions the parsers in your current environment supports (for just tsc or swc this will be something like .js, .cjs, .mjs, .ts, .d.ts, jsx, tsx). Ordering that array from most to least occurring will help as well as the resolver will try to find files in that order and stop once it’s found one. - older versions of dependency-cruiser initialised the
doNotFollow
to a bunch ofdependencyTypes
- it’s faster (and just as accurate - even in yarn PnP) to set apath
(typicallynode_modules
). - And as noted above - if your codebase can be compiled successfully with
swc
setting the parser toswc
will help speed up things - By default some rules dependency-cruiser ships with include regular expressions that will not be entirely applicable to your code base (I guess there’s not a lot of files ending with
.coffee
anymore these days …). It’s probably not going to make a big dent, but all little bits help.
will respond later to templating for comment fields and more granular rules
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Took me a bit to get back to this questions @sverweij Thanks a lot for that very very detailed explanation 🙏 Noticed that you added some caching (🔥😄Awesome!! ) So this might be a bit unnecessary if we can rely on it, but still wanted to explore this for non-cache runs.
So, started looking into all you mentioned and tried profiling our main frontend codebase (last run: 11286 modules, 37666 dependencies
)
For a bit of context, --progress performance-log
outputs the following:
elapsed heapTotal heapUsed after step...
1271ms 89Mb 64Mb start of node process
7ms 89Mb 64Mb parsing options
419ms 108Mb 82Mb parsing rule set
10ms 108Mb 83Mb determining how to resolve
0ms 108Mb 83Mb reading files
396ms 109Mb 91Mb reading files: gathering initial sources
303086ms 2223Mb 2001Mb reading files: visiting dependencies
3ms 2223Mb 2001Mb analyzing
2090ms 2225Mb 2008Mb analyzing: cycles
17154ms 2226Mb 2016Mb analyzing: dependents
7ms 2226Mb 2017Mb analyzing: orphans
3ms 2226Mb 2017Mb analyzing: reachables
2ms 2226Mb 2017Mb analyzing: module metrics
2ms 2226Mb 2017Mb analyzing: add focus (if any)
15588ms 2226Mb 2027Mb analyzing: validations
158ms 2226Mb 2072Mb reporting
103ms 2276Mb 2122Mb really done (340300ms)
The biggest part being, as you mentioned, reading files: visiting dependencies, so I went and tried diving a bit deeper into profiling/measurements and sum-up executions of internal functions, starting from getDependencies
https://github.com/sverweij/dependency-cruiser/blob/develop/src/extract/get-dependencies.js#L218 and wanted to share findings and questions with you.
The TLDR is that the resolver.resolveSync
here is where most of the work is being done (At least for our case), pretty much it's all the call to resolveSync
from "enhanced-resolve".
The actual runtimes: (all in milliseconds)
uniqBy (extractDependencies, getDependencyUniqueKey): 7058.00629478693
|-> extractWithTsc: 6941.484633654356
|-> readFileSync: 1273.1560349464417
|-> parseAST: 4571.4655103087425
addResolutionAttributes (.map) -> 294283.9156706929
|-> resolve: 293894.22545331717
|-> resolveWithRetry: 288099.59055906534
|-> stripToModuleName -> 122.57531198859215
|-> conditional check -> 6.731654405593872
|-> resolveModule -> 281106.77762940526
| -> resolveCommonJS -> 283997.5528010428
|-> addResolutionAttributes - conditional -> 29.38234919309616
|-> addResolutionAttributes - pathToPosix -> 282514.4725140035
|-> resolve -> 281423.1681384882 🚨🚨🚨🚨 // Pretty much the "resolver.resolveSync" call
|-> addResolutionAttributes - isFollowable -> 77.79942151904106
|-> stripToModuleName: 37.053578436374664
|-> create lResolvedDependency: 1174.7631360292435
|-> pathtoposix: 4216.38135227561
|-> matchesDoNotFollow: 142.73734945058823
|-> create response object: 156.89516851305962
Looked into some tracing node's --prof flag and looked to some flamegraphs via x0
and does look like like a loot of work goes to either enhanced-resolve
or tsconfig-paths-webpack-plugin
Filtered out "application" calls, and then clicked on the top-most element of flame JIC. This is one, but there's a buuuunch like this.
So, I Started thinking on a few things, and had more questions for you 😅
- What does the
enhanced-resolve
package solves for this project? - And to that point... Could there be a faster way of resolving modules?
- Like Is there a way we could leverage the TSC/SWC or default node's for module resolution instead for example?
- What I'm thinking ...if my application doesn't use webpack at all for example, could I get some benefits by not-relying on
enhanced-resolve
ortsconfig-paths-webpack-plugin
?
- Noticed also that
useSyncFileSystemCalls
is defined/forced as true here. Is it so we can useenhanced-resolve
'sresolveSync
?- Wondering if the sync calls are blocking us from leveraging the loop?
- Could we process extractRecursive with a queue and something like https://www.npmjs.com/package/p-map with some concurrency? 🤔
Sorry for all the questions, this is a very interesting project, thanks a lot for your time 🙏
Hi @fforres thanks for the detailed analysis.
Noticed that you added some caching (🔥😄Awesome!! )
Thanks 😊. The next step will be to only run on the diff between this and the previous run (it now only serves from cache when nothing has changed).
So this might be a bit unnecessary if we can rely on it, but still wanted to explore this for non-cache runs.
It's still necessary and useful to keep looking for optimisations; the ecosystem moves on and might have new opportunities. And there might be inefficiencies in the code that weren't detected yet.
What does the enhanced-resolve package solves for this project?
We use enhanced-resolve to resolve module names to files on disk. Different from other programming languages there's a myriad of ways this can be done (automatic expansions, implicit and explicit rules, aliases etc).
Given a module name a resolver will try a list of options. E.g. when you import something from './some-module' it'll try whether some-module
exists with all extensions it knows (.js, .json, .node, potentially .jsx - and in typescript environments also .ts, .tsx, .d.ts ...). For non-relative path without it'll look for something in node_modules matching the description (traversing up to your disks root to find node_modules folders), look up the. package.json, find the main module etc. etc. Each of these lookups will access the disk, when the lookup isn't cached yet. This is also why reducing the number of extensions, module systems etc. can reduce the time spent dramatically.
We've chosen enhanced-resolve over other options (e.g. browserify's resolve, or node's own require resolver) because it gives us the flexibility to influence the module resolution a.o. with its plugin system. It also has decent caching - and it's a separate module we can actually integrate (as far as I know esbuild and swc don't have that). The fact webpack uses it for module resolution as well was only a secondary consideration.
And to that point... Could there be a faster way of resolving modules?
If that exists (and it's just as feature rich as enhanced-resolve) it'd be interesting to have it in dependency-cruiser. Last time I checked swc and esbuild didn't have a separately callable resolver. My hypothesis is that enhanced-resolve (and/ or tsconfig-paths(/-plugin) in case of typescript) takes most most time in accessing files.
Noticed also that useSyncFileSystemCalls is defined/forced as true here. Is it so we can use enhanced-resolve's resolveSync?
yes
Wondering if the sync calls are blocking us from leveraging the loop?
It's definitely blocking. That would be a serious problem when nodejs would have something useful to do (e.g. when it's serving web requests). The other thing nodejs would be able to do at the same time is accessing other file(s). As far as I know that'll bump into physical restrictions of the disk in any case. In the past I've done some simple tests using sync vs async code accessing disk and to my surprise the async code was measurably slower. It's some time ago (> 1y), though, and I might've done something wrong in that test, so it might be worth a retry. Async code is more complicated to handle though, and is kind of viral requiring all calling code to be async as well (in the case of resolve.js about 18 modules):
Heyooo @sverweij necromancing this old issue to bring something to your attention.
So, recently I was navigating on the twitters, and I came across this project:
-
OXC
(link), a javascript compiler written in rust.
More than the compiler itself, I noticed they are composed of a couple of packaged tools and one of them is a resolver with the interesting claim of 28x faster than enhanced-resolve
It is written in Rust, and seems to be a port of enhanced-resolver
(with the tests-suite imported from it and passing)
Docs: https://oxc-project.github.io/docs/guide/usage/resolver.html Crate: https://docs.rs/oxc_resolver/latest/oxc_resolver
I know there's overhead with node/rust interop, (and probably that 28x faster
wont be a huge win in smaller projects) but I thought it might be interesting to bring to your attention 😄 (It might be worth it for big projects)
Hi @fforres thanks! That looks like a cool project - and 28x faster module resolution sounds very tempting. From a quick glance it looked like the (still experimental) nodejs binding has an interface that is quite close to enhanced-resolve, so it shouldn't be too hard to try out.