docusaurus
docusaurus copied to clipboard
Performance issue in `git log` commands to read last update metadata
Description
We need to read git log entries for:
- Content plugins
loadContent()lifecycle, to compute docs/blog/pages last update metadata whenshowLastUpdateTimeorshowLastUpdateAuthoris enabled - Sitemap
postBuild()lifecycle to compute the value of a<lastmod>tag from theroute?.metadata?.sourceFilePath(if provided by plugins, which may not be the case for third-party plugins)
As of Docusaurus 3.8, we queue commands like this one for each relevant file
git -c log.showSignature=false log --format=RESULT:%ct,%an --max-count=1 -- "i18n/fr/docusaurus-plugin-content-blog/releases/2.3/index.mdx"
Not exactly, here's a more accurate command:
In practice, we run this:
cd i18n/fr/docusaurus-plugin-content-blog/releases/2.3/
git -c log.showSignature=false log --format=RESULT:%ct,%an --max-count=1 -- "index.mdx"
We do this because it's possible that users have git submodules, and the Git repo at the root may not contain the history of that specific file.
Performance problems
There are various performance bottlenecks in our implementation.
Main case
The git log command queueing works, but it remains a bottleneck, and there's probably room to improve things.
On a large 11k docs site, I noticed that loadContent() takes 24s when docs lastUpdate options are enabled, but only 4s when they are disabled, and these options are the main bottleneck of the loadContent() lifecycle for large sites.
I'm not sure how to improve the performance of this, but we likely have a few options to explore.
Useful resources:
- https://github.com/Brooooooklyn/simple-git
- https://www.nodegit.org/
- https://github.com/isomorphic-git/isomorphic-git
Suggested in:
- https://x.com/sebastienlorber/status/1927310075284111744
- https://bsky.app/profile/sebastienlorber.com/post/3lq5hx3kuec2d
We could also try to optimize the git commands. I think rev-parse and batching history read commands might work but we need to study that.
We could also try to create an incremental "last update" cache, reading the full history first, and then only reading from the newly added commits on rebuilds. This could also help to display the real last update data in dev. But maybe it would be slow to initialize for very large Git repos 🤷♂
i18n / codegen / untracked case
I've also noticed that the commands we issue are significantly slower for untracked Git files. This is usually the case:
- when downloading i18n mdx files from Crowdin
- when using codegen plugins creating untracked MDX files before running
docusaurus build
This can be seen locally on our own website on a localized build after downloading French translation files:
[PERF] Build > en > Load site - 4.53 seconds!
[PERF] Build > en > postBuild() - 3.34 seconds!
[PERF] Build > fr > Load site - 6.15 seconds!
[PERF] Build > fr > postBuild() - 11.53 seconds!
It is unexpected that a French site build has a slower loadContent() and postBuild() compared to the English site.
The impact is even more significant on our Netlify CI, not as powerful as my mac M3 Pro
[PERF] Build > en > Load site - 5.71 seconds!
[PERF] Build > en > postBuild() - 12.78 seconds!
[PERF] Build > fr > Load site - 104.85 seconds!
[PERF] Build > fr > postBuild() - 203.77 seconds!
[PERF] Build > pt-BR > Load site - 191.19 seconds!
[PERF] Build > pt-BR > postBuild() - 298.34 seconds!
[PERF] Build > ko > Load site - 265.14 seconds!
[PERF] Build > ko > postBuild() - 372.63 seconds!
[PERF] Build > zh-CN > Load site - 352.54 seconds!
[PERF] Build > zh-CN > postBuild() - 449.76 seconds!
Note: there's likely a logging issue (or something worse 🤔), numbers look like they accumulate and are not expected to grow like that, but we can already see that loadSite() and postBuild are much slower on localized sites.
It seems that the postBuild() hook takes more time for subsequent locales in practices so I suspect something really weird is happening here, but I can't reproduce it locally yet.
Duplicated work
If a file is not tracked by Git and the history couldn't be read in loadContent(), then we try to read it again in postBuild(), which, as described above, is quite slow.
This should be easy to fix so I'll handle this right now in https://github.com/facebook/docusaurus/pull/11211
Design issue
There are cases where we know for sure there's no last update metadata available. It should be possible to configure the site or plugins to avoid reading the Git history entirely. It's probably possible today through parseFrontMatter and process.env.DOCUSAURUS_CURRENT_LOCALE, but it is clearly not intuitive and we should provide better APIs.
Examples include:
- i18n sites: if translations are all downloaded from Crowdin, we likely don't want to read the last update metadata from Git for any file in the untracked
i18nfolder. - plugins using code-generation: we have various third-party plugins that generate MDX files that our content plugins will then read. An example is our own Changelog plugin. It should be possible to configure our content plugins so that they directly emit
nullvalues instead of running uselessgit logcommands.
In general, we probably want to introduce a global siteConfig.readLastUpdateData callback hook so that it's possible to implement other strategies that are not related to Git.
Here's an example usage based on a pseudo-implementation:
const siteConfig = {
readLastUpdateData: async (filePath, utils) => {
if (filePath.startsWith("i18n")) {
return null;
}
if (filePath.startsWith("changelog")) {
return utils.readFromFS(filePath);
}
return utils.readFromGit(filePath);
}
}
The git log issue seriously affects our CI build times.
Those are times measured on Netlify on cold builds (no Rspack persistent cache)
Before any optimization:
- Total CI time: 45min
- Docusaurus build time: 34min
- Log: https://app.netlify.com/projects/docusaurus-2/deploys/6836e5968987a24ac2b722e8
After the duplicate call optimization:
- Total CI time: 30min
- Docusaurus build time: 18min
- Log: https://app.netlify.com/projects/docusaurus-2/deploys/6836fb80f5f589000825dabc
After the local i18n opt-out optimization:
- Total CI time: 15min
- Docusaurus build time: 6min
- Log: https://app.netlify.com/projects/docusaurus-2/deploys/68371f9a8169c60ebd35f258
Our CI spends approximately 10min to setup things (initialization, building the monorepo, building and downloading Crowdin site translations...)
So, optimizing these git log calls and removing useless ones can have a very impactful result on Docusaurus build times, depending on the site.
In our case, docusaurus build went from 34min to 6min 🔥. We now need to figure out how to ship these improvements to the community, in a way that preferably doesn't require complex configuration optimizations.
I spent way too much time trying to figure out why git log calls are slow in our Netlify CI, and keep increasing as we build more locales. I'm not even sure this affects anyone apart us right now so I'm going to stop and move on for now, but please let me know if you are affected as well.
Here's some notes from various experiments run from this PR: https://github.com/facebook/docusaurus/pull/11218
To make feedback quicker, I versioned the i18n translations from Crowdin to avoid downloading them (slow).
Git log benchmarks
Here's a local git log command benchmark. Options we use are not super relevant, but we can clearly see that it's slower for an untracked file ignored/architecture.mdx.
Note: I didn't run these commands on Netlify, that could be worth giving it a try if possible.
hyperfine \
'git log --max-count=1 -- "website/docs/advanced/architecture.mdx"' \
'git log --max-count=1 -- "website/i18n/fr/docusaurus-plugin-content-docs/current/advanced/architecture.mdx"' \
'git log --max-count=1 -- "website/i18n/zh-CN/docusaurus-plugin-content-docs/current/advanced/architecture.mdx"' \
'git log --max-count=1 -- "website/i18n/deeply/nested/zh-XX/docusaurus-plugin-content-docs/current/advanced/architecture.mdx"' \
'git log --max-count=1 -- "website/i18n/test/architecture.mdx"' \
'git log --max-count=1 -- "ignored/architecture.mdx"'
Benchmark 1: git log --max-count=1 -- "website/docs/advanced/architecture.mdx"
Time (mean ± σ): 12.3 ms ± 3.4 ms [User: 7.6 ms, System: 3.8 ms]
Range (min … max): 10.5 ms … 39.7 ms 70 runs
Benchmark 2: git log --max-count=1 -- "website/i18n/fr/docusaurus-plugin-content-docs/current/advanced/architecture.mdx"
Time (mean ± σ): 6.4 ms ± 0.5 ms [User: 3.5 ms, System: 2.5 ms]
Range (min … max): 5.4 ms … 8.1 ms 273 runs
Benchmark 3: git log --max-count=1 -- "website/i18n/zh-CN/docusaurus-plugin-content-docs/current/advanced/architecture.mdx"
Time (mean ± σ): 6.3 ms ± 0.4 ms [User: 3.5 ms, System: 2.4 ms]
Range (min … max): 5.4 ms … 8.9 ms 293 runs
Benchmark 4: git log --max-count=1 -- "website/i18n/deeply/nested/zh-XX/docusaurus-plugin-content-docs/current/advanced/architecture.mdx"
Time (mean ± σ): 6.3 ms ± 0.4 ms [User: 3.5 ms, System: 2.4 ms]
Range (min … max): 5.3 ms … 7.5 ms 298 runs
Benchmark 5: git log --max-count=1 -- "website/i18n/test/architecture.mdx"
Time (mean ± σ): 6.4 ms ± 0.4 ms [User: 3.5 ms, System: 2.4 ms]
Range (min … max): 5.2 ms … 7.8 ms 316 runs
Benchmark 6: git log --max-count=1 -- "ignored/architecture.mdx"
Time (mean ± σ): 25.6 ms ± 2.5 ms [User: 20.8 ms, System: 3.9 ms]
Range (min … max): 23.5 ms … 44.4 ms 63 runs
Summary
git log --max-count=1 -- "website/i18n/deeply/nested/zh-XX/docusaurus-plugin-content-docs/current/advanced/architecture.mdx" ran
1.01 ± 0.10 times faster than git log --max-count=1 -- "website/i18n/zh-CN/docusaurus-plugin-content-docs/current/advanced/architecture.mdx"
1.02 ± 0.10 times faster than git log --max-count=1 -- "website/i18n/test/architecture.mdx"
1.02 ± 0.10 times faster than git log --max-count=1 -- "website/i18n/fr/docusaurus-plugin-content-docs/current/advanced/architecture.mdx"
1.97 ± 0.57 times faster than git log --max-count=1 -- "website/docs/advanced/architecture.mdx"
4.09 ± 0.49 times faster than git log --max-count=1 -- "ignored/architecture.mdx"
The problem doesn't seem related to the extra args we use.
Tracked vs non-tracked
Although git log commands are slower for untracked files, the difference should be in the order of less than 100ms, and doesn't explain why our CI would take up to 20 seconds to run a single command.
Adding i18n to Git didn't change that, so the problem doesn't seem related to tracked vs non-tracked.
Execa vs child_process, shell vs non-shell
I tried removing Execa in favor of a simpler child_process integration:
function execPromise(
command: string,
options: ExecOptions,
): Promise<{exitCode: number; stdout: string; stderr: string}> {
// options.shell = '/bin/bash';
return new Promise((resolve, reject) => {
exec(command, options, (error, stdout, stderr) => {
if (error) {
reject(error);
return;
}
resolve({exitCode: 0, stdout, stderr});
});
});
}
This didn't lead to any significant improvement.
Note: we use execa with shell: true. It's possible (and even recommended by Execa) to not use a shell (because apparently it's slower and less secure), but from my tests I can clearly see it's faster locally to use the shell option (with both execa/child_process).
Using git ls-files
I tried running a git command upfront to check if the file is tracked, but it didn't improve anything
const command = `(git ls-files --error-unmatch "${file}" || exit 42) && git log ${args} -- "${file}"`;
😢 this also affect GitHub Actions: https://github.com/facebook/docusaurus/actions/runs/15351301297/job/43199950972?pr=11218
I tried running the exact same CI commands locally and didn't encounter this problem.
Apparently someone else encountered that problem: https://github.com/facebook/docusaurus/discussions/11398
Have you looked into using a Git library like https://github.com/nodegit/nodegit or https://github.com/isomorphic-git/isomorphic-git ? Might conceivably be faster than spawning a bunch of processes.
I agree, no need to spawn processes here.
Yes, those libraries are listed in the issue description already as solutions to consider.
From what I remember:
- Nodegit doesn't seem to support Node >= 18, so if it works we need to fix that as well
- Couldn't get simple-git to work
- I didn't try isomorphic-git, but I wonder if it will be as fast as native
Definitively agree that we shouldn't spawn one process per file. Retrieving the full Git history with timestamps/authors is relatively fast even for a repo like ours that has 10000+ commits.
But we also need to take into account:
- the modular nature of Docusaurus, which may need to call the
getLastUpdate()function in various plugins and plugin steps (we can't read everything upfront globally) - the fact that our users have esoteric setups involving git submodules or many independent git repositories
- the fact that some files may be untracked
@slorber not sure if you've found other solutions, I found this is much faster on my machine:
#!/bin/bash
# Get the last updated time of files in a directory based on Git history
git log --name-only --follow --format="COMMIT %h %ci" -- "$1" | awk '
/^COMMIT/ {commit=$2; date=$3 " " $4 " " $5}
!/^COMMIT/ && !/^$/ {files[$0]=date}
END {for (file in files) print file ": " files[file]}
'
you may batch reads to optimize the performance
FYI, here's a follow-up issue where we'll see if the e18e community could be interested in helping us solve this perf bottleneck efficnetly once for all: https://github.com/e18e/ecosystem-issues/issues/216