yari icon indicating copy to clipboard operation
yari copied to clipboard

refactor(build/git-history): only collect existing files and add type hints

Open yin1999 opened this issue 10 months ago • 1 comments

Summary

  1. according to the usage of git history, we will use the merge commit hash for those parent commit which is existed in the main branch.

The case the merged field would be used:

https://github.com/mdn/translated-content/commit/94fd10f637122772873430b6e2d18d218398dc26 is the parent of the merge commit: https://github.com/mdn/translated-content/commit/400550f16805b2c9ce031c3ff9db5c2aae1dde1f

In this case, the value of key fr/mdn/contribute/howto/creer_un_exercice_interactif_pour_apprendre_le_web/index.html would have the merged field, and the hash value of merged would be 400550f16805b2c9ce031c3ff9db5c2aae1dde1f.

This means, we will use the merge commit as the actual commit hash of this file.

To gather the git history of such a case, the follow code can be used:

git-history.ts used to gather the git history
// git-history.ts
import fs from "node:fs";
import path from "node:path";

import { execGit } from "../content/index.js";
import { CONTENT_ROOT } from "../libs/env/index.js";

function getFromGit(contentRoot = CONTENT_ROOT) {
  // If `contentRoot` was a symlink, the `repoRoot` won't be. That'll make it
  // impossible to compute the relative path for files within when we get
  // output back from `git log ...`.
  // So, always normalize to the real path.
  const realContentRoot = fs.realpathSync(contentRoot);

  const repoRoot = execGit(["rev-parse", "--show-toplevel"], {
    cwd: realContentRoot,
  });

  const MARKER = "COMMIT:";
  const DELIMITER = "_";
  const output = execGit(
    [
      "log",
      "--name-only",
      "--no-decorate",
      `--format=${MARKER}%H${DELIMITER}%cI${DELIMITER}%P`,
      "--date-order",
      "--reverse",
      // "Separate the commits with NULs instead of with new newlines."
      // So each line isn't, possibly, wrapped in "quotation marks".
      // Now we just need to split the output, as a string, by \0.
      "-z",
    ],
    {
      cwd: repoRoot,
    },
    repoRoot
  );

  const map = new Map();
  let date = null;
  let hash = null;
  // Even if we specified the `-z` option to `git log ...` above, sometimes
  // it seems `git log` prefers to use a newline character.
  // At least as of git version 2.28.0 (Dec 2020). So let's split on both
  // characters to be safe.
  const parents = new Map();
  for (const line of output.split(/\0|\n/)) {
    if (line.startsWith(MARKER)) {
      const data = line.replace(MARKER, "").split(DELIMITER);
      hash = data[0];
      date = new Date(data[1]);
      if (data[2]) {
        const split = data[2].split(" ");
        if (split.length === 2) {
          const [, parentHash] = split;
          parents.set(parentHash, { modified: date, hash });
        }
      }
    } else if (line) {
      const relPath = path.relative(realContentRoot, path.join(repoRoot, line));
      map.set(relPath, { modified: date, hash });
    }
  }
  return [map, parents];
}

export function gather(contentRoots, previousFile = null) {
  const map = new Map();
  if (previousFile) {
    const previous = JSON.parse(fs.readFileSync(previousFile, "utf-8"));
    for (const [key, value] of Object.entries(previous)) {
      map.set(key, value);
    }
  }
  // Every key in this map is a path, relative to root.
  for (const contentRoot of contentRoots) {
    const [commits, parents] = getFromGit(contentRoot);
    for (const [key, value] of commits) {
      // Because CONTENT_*_ROOT isn't necessarily the same as the path relative to
      // the git root. For example "../README.md" and since those aren't documents
      // exclude them.
      // We also only care about documents.
      if (
        !key.startsWith(".") &&
        (key.endsWith("index.html") || key.endsWith("index.md"))
      ) {
        map.set(key, Object.assign(value, { merged: parents.get(value.hash) }));
        if (value.hash === '94fd10f637122772873430b6e2d18d218398dc26') {
          console.log(key, value, parents.get(value.hash));
          return map;
        }
      }
    }
  }
  return map;
}

In this process, we collected an additional hash value that will not be used.

Solution

git log has a flag called ---first-parent, which follows only the first parent commit upon seeing a merge commit, after adding this flag, the above file will be associated with https://github.com/mdn/translated-content/commit/400550f16805b2c9ce031c3ff9db5c2aae1dde1f:

image

After adding this flag, we don't need to get the parent mapping before generating the git history.

To gather the git history of such a case, the follow code can be used:

git-history.ts used to gather the git history
import fs from "node:fs";
import path from "node:path";

import { execGit } from "../content/index.js";
import { CONTENT_ROOT } from "../libs/env/index.js";

export type Commit = {
  modified: string; // ISO 8601 format
  hash: string;
};

interface CommitHistory {
  [filePath: string]: Commit;
}

function getFromGit(contentRoot = CONTENT_ROOT) {
  // If `contentRoot` was a symlink, the `repoRoot` won't be. That'll make it
  // impossible to compute the relative path for files within when we get
  // output back from `git log ...`.
  // So, always normalize to the real path.
  const realContentRoot = fs.realpathSync(contentRoot);

  const repoRoot = execGit(["rev-parse", "--show-toplevel"], {
    cwd: realContentRoot,
  });

  const MARKER = "COMMIT:";
  const DELIMITER = "_";
  const output = execGit(
    [
      "log",
      "--name-only",
      "--no-decorate",
      `--format=${MARKER}%H${DELIMITER}%cI`,
      "--date-order",
      "--reverse",
      // use the merge commit's date, as this is the date the content can
      // be built and deployed. And such behavior is similar to
      // GitHub's "Squash and merge" option.
      "--first-parent",
      // "Separate the commits with NULs instead of with new newlines."
      // So each line isn't, possibly, wrapped in "quotation marks".
      // Now we just need to split the output, as a string, by \0.
      "-z",
    ],
    {
      cwd: repoRoot,
    },
    repoRoot
  );

  const map = new Map<string, Commit>();
  let date: string = null;
  let hash: string = null;
  // Even if we specified the `-z` option to `git log ...` above, sometimes
  // it seems `git log` prefers to use a newline character.
  // At least as of git version 2.28.0 (Dec 2020). So let's split on both
  // characters to be safe.
  for (const line of output.split(/\0|\n/)) {
    if (line.startsWith(MARKER)) {
      const [hashStr, dateStr] = line.replace(MARKER, "").split(DELIMITER);
      hash = hashStr;
      date = new Date(dateStr).toISOString();
    } else if (line) {
      const relPath = path.relative(realContentRoot, path.join(repoRoot, line));
      map.set(relPath, { modified: date, hash });
      if (hash === '400550f16805b2c9ce031c3ff9db5c2aae1dde1f') {
        return map;
      }
    }
  }
  return map;
}

// Read the git history from the specified file.
// If the file doesn't exist, return an empty object.
export function readGitHistory(historyFilePath: string): CommitHistory {
  if (fs.existsSync(historyFilePath)) {
    return JSON.parse(fs.readFileSync(historyFilePath, "utf-8"));
  }
  return {};
}

export function gather(contentRoots: string[], previousFile: string = null) {
  const map = new Map<string, Commit>();
  if (previousFile) {
    const previous = readGitHistory(previousFile);
    for (const [key, value] of Object.entries(previous)) {
      map.set(key, value);
    }
  }
  // Every key in this map is a path, relative to root.
  for (const contentRoot of contentRoots) {
    const commits = getFromGit(contentRoot);
    for (const [key, value] of commits) {
      // Because CONTENT_*_ROOT isn't necessarily the same as the path relative to
      // the git root. For example "../README.md" and since those aren't documents
      // exclude them.
      // We also only care about existing documents.
      if (
        !key.startsWith(".") &&
        (key.endsWith("index.html") || key.endsWith("index.md"))
      ) {
        map.set(key, value);
        if (value.hash === '400550f16805b2c9ce031c3ff9db5c2aae1dde1f') {
          return map;
        }
      }
    }
  }
  return map;
}

Screenshots

Before

The gathered information of fr/mdn/contribute/howto/creer_un_exercice_interactif_pour_apprendre_le_web/index.html using the code provided in Summary:

image

After

The gathered information of fr/mdn/contribute/howto/creer_un_exercice_interactif_pour_apprendre_le_web/index.html using the code provided in Solution:

image


How did you test this change?

Run yarn tool gather-git-history.

yin1999 avatar Apr 07 '24 14:04 yin1999