docusaurus icon indicating copy to clipboard operation
docusaurus copied to clipboard

Performance issue in `git log` commands to read last update metadata

Open slorber opened this issue 5 months ago • 3 comments

Description

We need to read git log entries for:

  • Content plugins loadContent() lifecycle, to compute docs/blog/pages last update metadata when showLastUpdateTime or showLastUpdateAuthor is enabled
  • Sitemap postBuild() lifecycle to compute the value of a <lastmod> tag from the route?.metadata?.sourceFilePath (if provided by plugins, which may not be the case for third-party plugins)

As of Docusaurus 3.8, we queue commands like this one for each relevant file

git -c log.showSignature=false log --format=RESULT:%ct,%an --max-count=1 -- "i18n/fr/docusaurus-plugin-content-blog/releases/2.3/index.mdx"
Not exactly, here's a more accurate command:

In practice, we run this:

cd i18n/fr/docusaurus-plugin-content-blog/releases/2.3/
git -c log.showSignature=false log --format=RESULT:%ct,%an --max-count=1 -- "index.mdx"

We do this because it's possible that users have git submodules, and the Git repo at the root may not contain the history of that specific file.


Performance problems

There are various performance bottlenecks in our implementation.

Main case

The git log command queueing works, but it remains a bottleneck, and there's probably room to improve things.

On a large 11k docs site, I noticed that loadContent() takes 24s when docs lastUpdate options are enabled, but only 4s when they are disabled, and these options are the main bottleneck of the loadContent() lifecycle for large sites.

I'm not sure how to improve the performance of this, but we likely have a few options to explore.

Useful resources:

  • https://github.com/Brooooooklyn/simple-git
  • https://www.nodegit.org/
  • https://github.com/isomorphic-git/isomorphic-git

Suggested in:

  • https://x.com/sebastienlorber/status/1927310075284111744
  • https://bsky.app/profile/sebastienlorber.com/post/3lq5hx3kuec2d

We could also try to optimize the git commands. I think rev-parse and batching history read commands might work but we need to study that.

We could also try to create an incremental "last update" cache, reading the full history first, and then only reading from the newly added commits on rebuilds. This could also help to display the real last update data in dev. But maybe it would be slow to initialize for very large Git repos 🤷‍♂

i18n / codegen / untracked case

I've also noticed that the commands we issue are significantly slower for untracked Git files. This is usually the case:

  • when downloading i18n mdx files from Crowdin
  • when using codegen plugins creating untracked MDX files before running docusaurus build

This can be seen locally on our own website on a localized build after downloading French translation files:

[PERF] Build > en > Load site - 4.53 seconds!
[PERF] Build > en > postBuild() - 3.34 seconds!

[PERF] Build > fr > Load site - 6.15 seconds!
[PERF] Build > fr > postBuild() - 11.53 seconds!

It is unexpected that a French site build has a slower loadContent() and postBuild() compared to the English site.

The impact is even more significant on our Netlify CI, not as powerful as my mac M3 Pro

[PERF] Build > en > Load site - 5.71 seconds!
[PERF] Build > en > postBuild() - 12.78 seconds!

[PERF] Build > fr > Load site - 104.85 seconds!
[PERF] Build > fr > postBuild() - 203.77 seconds!

[PERF] Build > pt-BR > Load site - 191.19 seconds!
[PERF] Build > pt-BR > postBuild() - 298.34 seconds!

[PERF] Build > ko > Load site - 265.14 seconds!
[PERF] Build > ko > postBuild() - 372.63 seconds!

[PERF] Build > zh-CN > Load site - 352.54 seconds!
[PERF] Build > zh-CN > postBuild() - 449.76 seconds!

Note: there's likely a logging issue (or something worse 🤔), numbers look like they accumulate and are not expected to grow like that, but we can already see that loadSite() and postBuild are much slower on localized sites.

It seems that the postBuild() hook takes more time for subsequent locales in practices so I suspect something really weird is happening here, but I can't reproduce it locally yet.

Duplicated work

If a file is not tracked by Git and the history couldn't be read in loadContent(), then we try to read it again in postBuild(), which, as described above, is quite slow.

This should be easy to fix so I'll handle this right now in https://github.com/facebook/docusaurus/pull/11211


Design issue

There are cases where we know for sure there's no last update metadata available. It should be possible to configure the site or plugins to avoid reading the Git history entirely. It's probably possible today through parseFrontMatter and process.env.DOCUSAURUS_CURRENT_LOCALE, but it is clearly not intuitive and we should provide better APIs.

Examples include:

  • i18n sites: if translations are all downloaded from Crowdin, we likely don't want to read the last update metadata from Git for any file in the untracked i18n folder.
  • plugins using code-generation: we have various third-party plugins that generate MDX files that our content plugins will then read. An example is our own Changelog plugin. It should be possible to configure our content plugins so that they directly emit null values instead of running useless git log commands.

In general, we probably want to introduce a global siteConfig.readLastUpdateData callback hook so that it's possible to implement other strategies that are not related to Git.

Here's an example usage based on a pseudo-implementation:

const siteConfig = {
  readLastUpdateData: async (filePath, utils) => {
    if (filePath.startsWith("i18n")) {
      return null;
    }
    if (filePath.startsWith("changelog")) {
      return utils.readFromFS(filePath);
    }
    return utils.readFromGit(filePath);
  }
}

slorber avatar May 28 '25 09:05 slorber

The git log issue seriously affects our CI build times.

Those are times measured on Netlify on cold builds (no Rspack persistent cache)

Before any optimization:

  • Total CI time: 45min
  • Docusaurus build time: 34min
  • Log: https://app.netlify.com/projects/docusaurus-2/deploys/6836e5968987a24ac2b722e8

After the duplicate call optimization:

  • Total CI time: 30min
  • Docusaurus build time: 18min
  • Log: https://app.netlify.com/projects/docusaurus-2/deploys/6836fb80f5f589000825dabc

After the local i18n opt-out optimization:

  • Total CI time: 15min
  • Docusaurus build time: 6min
  • Log: https://app.netlify.com/projects/docusaurus-2/deploys/68371f9a8169c60ebd35f258

Our CI spends approximately 10min to setup things (initialization, building the monorepo, building and downloading Crowdin site translations...)

So, optimizing these git log calls and removing useless ones can have a very impactful result on Docusaurus build times, depending on the site.

In our case, docusaurus build went from 34min to 6min 🔥. We now need to figure out how to ship these improvements to the community, in a way that preferably doesn't require complex configuration optimizations.

slorber avatar May 28 '25 15:05 slorber

I spent way too much time trying to figure out why git log calls are slow in our Netlify CI, and keep increasing as we build more locales. I'm not even sure this affects anyone apart us right now so I'm going to stop and move on for now, but please let me know if you are affected as well.

Here's some notes from various experiments run from this PR: https://github.com/facebook/docusaurus/pull/11218

To make feedback quicker, I versioned the i18n translations from Crowdin to avoid downloading them (slow).


Git log benchmarks

Here's a local git log command benchmark. Options we use are not super relevant, but we can clearly see that it's slower for an untracked file ignored/architecture.mdx.

Note: I didn't run these commands on Netlify, that could be worth giving it a try if possible.

hyperfine \
'git log --max-count=1 -- "website/docs/advanced/architecture.mdx"' \
'git log --max-count=1 -- "website/i18n/fr/docusaurus-plugin-content-docs/current/advanced/architecture.mdx"' \
'git log --max-count=1 -- "website/i18n/zh-CN/docusaurus-plugin-content-docs/current/advanced/architecture.mdx"' \
'git log --max-count=1 -- "website/i18n/deeply/nested/zh-XX/docusaurus-plugin-content-docs/current/advanced/architecture.mdx"' \
'git log --max-count=1 -- "website/i18n/test/architecture.mdx"' \
'git log --max-count=1 -- "ignored/architecture.mdx"'

Benchmark 1: git log --max-count=1 -- "website/docs/advanced/architecture.mdx"
  Time (mean ± σ):      12.3 ms ±   3.4 ms    [User: 7.6 ms, System: 3.8 ms]
  Range (min … max):    10.5 ms …  39.7 ms    70 runs

 
Benchmark 2: git log --max-count=1 -- "website/i18n/fr/docusaurus-plugin-content-docs/current/advanced/architecture.mdx"
  Time (mean ± σ):       6.4 ms ±   0.5 ms    [User: 3.5 ms, System: 2.5 ms]
  Range (min … max):     5.4 ms …   8.1 ms    273 runs
 
Benchmark 3: git log --max-count=1 -- "website/i18n/zh-CN/docusaurus-plugin-content-docs/current/advanced/architecture.mdx"
  Time (mean ± σ):       6.3 ms ±   0.4 ms    [User: 3.5 ms, System: 2.4 ms]
  Range (min … max):     5.4 ms …   8.9 ms    293 runs
 
Benchmark 4: git log --max-count=1 -- "website/i18n/deeply/nested/zh-XX/docusaurus-plugin-content-docs/current/advanced/architecture.mdx"
  Time (mean ± σ):       6.3 ms ±   0.4 ms    [User: 3.5 ms, System: 2.4 ms]
  Range (min … max):     5.3 ms …   7.5 ms    298 runs
 
Benchmark 5: git log --max-count=1 -- "website/i18n/test/architecture.mdx"
  Time (mean ± σ):       6.4 ms ±   0.4 ms    [User: 3.5 ms, System: 2.4 ms]
  Range (min … max):     5.2 ms …   7.8 ms    316 runs
 
Benchmark 6: git log --max-count=1 -- "ignored/architecture.mdx"
  Time (mean ± σ):      25.6 ms ±   2.5 ms    [User: 20.8 ms, System: 3.9 ms]
  Range (min … max):    23.5 ms …  44.4 ms    63 runs
 
Summary
  git log --max-count=1 -- "website/i18n/deeply/nested/zh-XX/docusaurus-plugin-content-docs/current/advanced/architecture.mdx" ran
    1.01 ± 0.10 times faster than git log --max-count=1 -- "website/i18n/zh-CN/docusaurus-plugin-content-docs/current/advanced/architecture.mdx"
    1.02 ± 0.10 times faster than git log --max-count=1 -- "website/i18n/test/architecture.mdx"
    1.02 ± 0.10 times faster than git log --max-count=1 -- "website/i18n/fr/docusaurus-plugin-content-docs/current/advanced/architecture.mdx"
    1.97 ± 0.57 times faster than git log --max-count=1 -- "website/docs/advanced/architecture.mdx"
    4.09 ± 0.49 times faster than git log --max-count=1 -- "ignored/architecture.mdx"

The problem doesn't seem related to the extra args we use.


Tracked vs non-tracked

Although git log commands are slower for untracked files, the difference should be in the order of less than 100ms, and doesn't explain why our CI would take up to 20 seconds to run a single command.

Adding i18n to Git didn't change that, so the problem doesn't seem related to tracked vs non-tracked.

Execa vs child_process, shell vs non-shell

I tried removing Execa in favor of a simpler child_process integration:

function execPromise(
  command: string,
  options: ExecOptions,
): Promise<{exitCode: number; stdout: string; stderr: string}> {
  // options.shell = '/bin/bash';

  return new Promise((resolve, reject) => {
    exec(command, options, (error, stdout, stderr) => {
      if (error) {
        reject(error);
        return;
      }
      resolve({exitCode: 0, stdout, stderr});
    });
  });
}

This didn't lead to any significant improvement.

Note: we use execa with shell: true. It's possible (and even recommended by Execa) to not use a shell (because apparently it's slower and less secure), but from my tests I can clearly see it's faster locally to use the shell option (with both execa/child_process).


Using git ls-files

I tried running a git command upfront to check if the file is tracked, but it didn't improve anything

  const command = `(git ls-files --error-unmatch "${file}" || exit 42) && git log ${args} -- "${file}"`;

slorber avatar May 30 '25 15:05 slorber

😢 this also affect GitHub Actions: https://github.com/facebook/docusaurus/actions/runs/15351301297/job/43199950972?pr=11218

I tried running the exact same CI commands locally and didn't encounter this problem.

slorber avatar May 30 '25 16:05 slorber

Apparently someone else encountered that problem: https://github.com/facebook/docusaurus/discussions/11398

slorber avatar Sep 08 '25 13:09 slorber

Have you looked into using a Git library like https://github.com/nodegit/nodegit or https://github.com/isomorphic-git/isomorphic-git ? Might conceivably be faster than spawning a bunch of processes.

markerikson avatar Sep 10 '25 15:09 markerikson

I agree, no need to spawn processes here.

Josh-Cena avatar Sep 10 '25 15:09 Josh-Cena

Yes, those libraries are listed in the issue description already as solutions to consider.

From what I remember:

  • Nodegit doesn't seem to support Node >= 18, so if it works we need to fix that as well
  • Couldn't get simple-git to work
  • I didn't try isomorphic-git, but I wonder if it will be as fast as native

Definitively agree that we shouldn't spawn one process per file. Retrieving the full Git history with timestamps/authors is relatively fast even for a repo like ours that has 10000+ commits.

But we also need to take into account:

  • the modular nature of Docusaurus, which may need to call the getLastUpdate() function in various plugins and plugin steps (we can't read everything upfront globally)
  • the fact that our users have esoteric setups involving git submodules or many independent git repositories
  • the fact that some files may be untracked

slorber avatar Sep 10 '25 16:09 slorber

@slorber not sure if you've found other solutions, I found this is much faster on my machine:

#!/bin/bash
# Get the last updated time of files in a directory based on Git history
git log --name-only --follow --format="COMMIT %h %ci" -- "$1" | awk '
  /^COMMIT/ {commit=$2; date=$3 " " $4 " " $5}
  !/^COMMIT/ && !/^$/ {files[$0]=date}
  END {for (file in files) print file ": " files[file]}
'

you may batch reads to optimize the performance

fuma-nama avatar Sep 14 '25 11:09 fuma-nama

FYI, here's a follow-up issue where we'll see if the e18e community could be interested in helping us solve this perf bottleneck efficnetly once for all: https://github.com/e18e/ecosystem-issues/issues/216

slorber avatar Oct 24 '25 12:10 slorber