yari icon indicating copy to clipboard operation
yari copied to clipboard

feat(translations/differences): Visualization of how far behind the latest commit of content is the translated-content commit

Open hochan222 opened this issue 1 year ago • 13 comments

Summary

Visualization of how far behind the latest commit of content is the traslated-content commit.

@mdn/localization-team-leads @queengooborg I mentioned that I need your help on the following two things. Let me know your opinions.

  1. I had to use execSync to use the git command. This means there will be delays. Any good workaround? https://github.com/mdn/yari/pull/8338#discussion_r1125502730
  2. I want to express it in a traffic light color according to the number of commits behind. How should I set the standard? Or is that a bad idea? This is an extension of what was said in the following discussion. https://github.com/mdn/mdn-community/discussions/333

Problem

In translated-content, l10n.sourceCommit is recorded as meta-data, but there is no area that can be checked in the dashboard.

---
title: TypedArray.prototype.entries()
slug: Web/JavaScript/Reference/Global_Objects/TypedArray/entries
l10n:
  sourceCommit: 2eb202adbe3d83292500ed46344d63fbbae410b5
---

{{JSRef}}

**`entries()`** 메서드는 해당 배열의 각 인덱스에 대한 키/값 쌍을 포함하는 새로운 {{jsxref("Array", "배열")}} 반복자 객체를 반환합니다.

TODO

  • [x] Optimizing the time taken by the git log command
    • AS IS
      • 6840 files (with source commit 1825 files): takes 930 seconds.
      • 1 file 690 commits for 0s~1s
    • TO BE
      • ~6840 files (with source commit 1825 files): within 120 seconds.~

Solution

Add a source commit element to the _translations/differences page.

On the dashboard page, the source commit element provides information about how far behind the latest hash of the content page.


Screenshots

  • l10n.sourceCommit exist image

  • l10n.sourceCommit not exist image

Before

No source commit element.

After

Added source commit element.

Test

  • [x] 43/43 Files Checked in sort(reverse sort) filter by Dashboard in ko-locale.
image

Optimization

  • [x] When reloading, the git log command is not executed by using the cache.
  • [x] Use git rev-list --count ${commitHash}..HEAD -- ${filename}
  • [x] source-commit.json file

hochan222 avatar Mar 04 '23 17:03 hochan222

Thanks a ton @hochan222 <3 It's been ages I wanted to have this and did not have time to implement.

SphinxKnight avatar Mar 06 '23 10:03 SphinxKnight

I've tried using workers to do a pre-processing to save the commitHashCache to a file and load it in getCommitBehindFromLatest function.

But as a result, I realized that the idea of ​​storing hashes for all files in content repo is bad. There are over 11000 markdown files in the content repo.. The code worked fine, but my computer exploded and shut down automatically. (We can solve it by splitting the file, but I'm wondering if it's a good way.)

Currently, PR's solution is problematic. In case of Japanese locale, sourcecommit meta tag was created for about 1800 files. I prevented to run the git log command when reload through https://github.com/mdn/yari/pull/8338/commits/36e4628d6b1ff24c1952011b9319fd5340f51b2b, but the initial loading time still takes 930 seconds (15m 30s). The number of files in the content repo, roughly 10000, takes 1.3 hours with the current solution.

Any other good ideas?

The idea of ​​storing hashes for all files in content repo.

Games, glossary, learn, mdn, mozilla, related, and webassembly are fine, but they explode on the web.

// worker.js
import { parentPort } from "node:worker_threads";
import { execSync } from "node:child_process";

const CONTENT_ROOT = "/path/to/content/files/en-us/webassembly";

// games
// glossary
// learn
// mdn
// mozilla
// related
// web
// webassembly

parentPort.on("message", (filepath) => {
  console.log(`worker ${filepath}...`);
  const commitHashes = execSync(`git log --pretty=format:%H -- ${filepath}`, {
    cwd: CONTENT_ROOT,
  })
    .toString()
    .split("\n");
  parentPort.postMessage({ filepath, commitHashes, done: true });
  parentPort.close();
});
// mainThread.js
import fs from "node:fs";
import path from "node:path";
import { Worker } from "node:worker_threads";

const CONTENT_ROOT = "/path/to/content/files/en-us/webassembly";
const CACHE_FILE_PATH = "./commitHashCache.json";

let commitHashCache = {};

// Check if the cache file exists, and if so, load the cache data from the file
if (fs.existsSync(CACHE_FILE_PATH)) {
  const cacheData = fs.readFileSync(CACHE_FILE_PATH, "utf8");
  commitHashCache = JSON.parse(cacheData);
}

function saveCommitHashCacheToFile() {
  const cacheData = JSON.stringify(commitHashCache);
  fs.writeFileSync(CACHE_FILE_PATH, cacheData, "utf8");
}

async function cacheAllFiles(folder) {
  const files = fs.readdirSync(folder);
  const promises = [];

  for (const file of files) {
    const filepath = path.join(folder, file);
    const stats = fs.statSync(filepath);

    if (stats.isDirectory()) {
      promises.push(cacheAllFiles(filepath));
    } 
    
    if (stats.isFile()) {
      if (path.extname(file) !== '.md') continue;

      if (!commitHashCache[filepath]) {
        const promise = new Promise((resolve, reject) => {
          const worker = new Worker('./worker.js', { workerData: filepath });

          worker.once('message', ({ filepath, commitHashes }) => {
            commitHashCache[filepath] = commitHashes;
            resolve();
          });
          worker.once('error', reject);
          worker.postMessage(filepath);
        });

        promises.push(promise);
      }
    }
  }

  await Promise.all(promises);
}

async function main() {
  await cacheAllFiles(CONTENT_ROOT);
  saveCommitHashCacheToFile();
}

main().catch(console.error);

hochan222 avatar Mar 06 '23 16:03 hochan222

Some idea about source commit hash cache:

  • only store raw hash data (which means that convert the hash string from hex format to binary), we can save half of the memory usage.
  • do not cache all the source commit hash (if hash_A is required to compare, just cache commits around this required commit. If an earlier commit is required for future call, just do a deeper cache)

But I'm not really sure about the second one.

yin1999 avatar Mar 06 '23 23:03 yin1999

This pull request has merge conflicts that must be resolved before it can be merged.

github-actions[bot] avatar Mar 11 '23 14:03 github-actions[bot]

image

🎉🎉 Now just create source-commit.json file for the first time and we can render ja-locale page in 18 seconds even after server restart. (When the server is turned on, it is cached on reload and rendered within 4 seconds.)

  • not saved in source-commit.json: 6840 files (with source commit 1825 files): takes 214 seconds. (3m 34s)
  • saved in source-commit.json: 6840 files (with source commit 1825 files): takes 18 seconds.

It seems ready now. The first step is long, but maybe just creating the file is enough.

source-commit.json file

// source-commit.json
{
    "ko/glossary/accessibility": 1,
    "ko/glossary/style_origin": 1,
    "ko/web/javascript": 0,
    "ko/web/security": 3,
    "ko/web/http/csp": 1,
    "ko/web/css/gap": 1,
    ...
}

source-commit-report.json file

This is not redundantly checked, it is appended to the end of the file each time the server is restarted.

This file is recorded when the hash value of stored meta-data is incorrect (e.g. hash does not exist in the content file in normal cases).

// source-commit-report.txt
ja/web/api/pointerevent/tangentialpressure: 708baf34eabb75789bcd3314a6879da3702024d1
ja/web/api/workerglobalscope/languagechange_event: 0fe2d3ee23b1b23be971d42c7c56729bd23a3f83
ja/web/api/pointerevent/getcoalescedevents: 708baf34eabb75789bcd3314a6879da3702024d1
...

Visualization

image image

hochan222 avatar Mar 11 '23 14:03 hochan222

further work

I hope that the source-commit I worked on in the PR is expressed in the card in the image below.

Even if it's not detailed, the translated page can gain a lot of trust from people who read MDN just by showing approximate status like red, green(or blue), and orange traffic lights. Alternatively, we can provide a funnel for page contributions to potential contributors. https://github.com/orgs/mdn/discussions/333

Let me know what you think.

image

To do that, I think I need a way to provide the source-commit.json file to aws in the same way as the popularities.json file. Is there anyone who can help me?

hochan222 avatar Mar 11 '23 15:03 hochan222

@caugner Hello. Currently, the PR is ready, but it is pending in the open state, so I mentioned it. Could you please designate a reviewer for the current PR? Thank you :)

hochan222 avatar Mar 20 '23 15:03 hochan222

This is nice! My review of some specific changes to make is incoming, but I've come up with a proof of concept for a much faster way of calculating these numbers:

On my machine, with the current approach: Find all translated documents (ja): 3:26.632 (m:ss.mmm) With this new approach: Find all translated documents (ja): 22.895s It's such a massive speed-up I'm not sure I entirely believe it, but I double checked I wasn't returning values from a cache or whatever, and I'm quite confident. Please verify it yourself though!

The key problem with the current approach, is by running a git command for each file, we keep traversing the same section of the commit graph, but git has to load it from disk each time.

Instead if we invert the process a little bit, and load the files changed in each commit from git, storing that in memory, we can then do that repeated graph traversal very quickly, because we load it from memory each time. We obviously don't want to load the entire commit graph into memory, but we can keep expanding the graph loaded into memory whenever we hit a commit that isn't in it.

Have a look at my proof of concept code in: https://github.com/leomca/yari/commit/a0f258c016208bfd6fe4a8ab9e3103bd77c85df4

It needs a bit of cleanup - for a start, it fully skips the cache so development was easier. With the cache invalidation I suggest in my review comments, we probably want to add that back. And it could possibly do with some better variable names - so feel free to modify it as much as you want :)

LeoMcA avatar Mar 23 '23 18:03 LeoMcA

@LeoMcA I'm really sorry for the late reply because I've been busy with work for two weeks. I feel sorry for not being able to interact as quickly as you cared about.

I think the way you suggested is great. spawn 👍👍 I understand. Thank you.

Log measurements were taken on three laptops, details are at the bottom. I was worried about memory, but looking at the logs, it seems to be ok.

You have the experience using this dashboard, so I'll let you decide how irritating the various tradeoffs (long startup, not being able to invalidate the cache by simply restarting the server, etc.) would be and decide which option to take.

On a laptop(Surface Laptop2) with poor performance, there was a difference of 1m 10s based on the ja locale depending on the existence of the source-commit.json file. I want to keep source-commit.json file because there may be contributors with poor laptop performance. So I chose the third of the options you gave. (https://github.com/mdn/yari/pull/8338/commits/afcee51449565915cbb2faf67fe30ed1b05465ad)

performance according to the laptop

Laptop Locale Server start (source-commit.json file not exist) Server restart (source-commit.json file exist) Reload
MacBook Pro ja 19.416s 14.861s 1.389s
MacBook Pro ko 6.685s 7.179s 1.274s
MacBook Air ja 11.983s 12.128s 272.352ms
MacBook Air ko 4.106s 3.867s 242.033ms
Surface Laptop2 ja 1:41.932 (m:ss.mmm) 31.946s 3.377s
Surface Laptop2 ko 31.143s 18.694s 2.752s

I measured the following three laptops.

  • MacBook Pro(M1, 16GB, 2021 model)
  • MacBook Air(M2, 8GB, 2022 model)
  • Surface Laptop2 (Intel® Core™ 8-i7, 8GB)

(1) In the case of the macbook pro, it was a company laptop, so I proceeded with a few programs turned on. (2) All measured without removing rss console.log (3) For "Server start", three measurements were taken and averaged.

rss log

MacBook Pro(M1, 16GB, 2021 model)

### server start (source-commit.json file not exist) 

(1)
오후 9:32:33 server.1     |  rss before: 262586368 bytes
오후 9:32:37 server.1     |  rss after: 256000000 bytes
오후 9:32:37 server.1     |  rss diff: -6586368 bytes
오후 9:32:39 server.1     |  rss before: 260685824 bytes
오후 9:32:39 server.1     |  rss after: 260784128 bytes
오후 9:32:39 server.1     |  rss diff: 98304 bytes
오후 9:32:48 server.1     |  rss before: 278544384 bytes
오후 9:32:48 server.1     |  rss after: 244842496 bytes
오후 9:32:48 server.1     |  rss diff: -33701888 bytes
오후 9:32:52 server.1     |  Find all translated documents (ja): 20.461s
오후 9:33:03 server.1     |  Find all translated documents (ko): 6.784s

(2)
오후 9:43:41 server.1     |  rss before: 262602752 bytes
오후 9:43:44 server.1     |  rss after: 284033024 bytes
오후 9:43:44 server.1     |  rss diff: 21430272 bytes
오후 9:43:46 server.1     |  rss before: 285048832 bytes
오후 9:43:46 server.1     |  rss after: 285065216 bytes
오후 9:43:46 server.1     |  rss diff: 16384 bytes
오후 9:43:53 server.1     |  rss before: 308019200 bytes
오후 9:43:53 server.1     |  rss after: 266960896 bytes
오후 9:43:53 server.1     |  rss diff: -41058304 bytes
오후 9:43:59 server.1     |  Find all translated documents (ja): 19.008s
오후 9:44:51 server.1     |  Find all translated documents (ko): 6.601s

(3)
오후 9:48:41 server.1     |  rss before: 258359296 bytes
오후 9:48:44 server.1     |  rss after: 289980416 bytes
오후 9:48:44 server.1     |  rss diff: 31621120 bytes
오후 9:48:46 server.1     |  rss before: 283033600 bytes
오후 9:48:46 server.1     |  rss after: 283082752 bytes
오후 9:48:46 server.1     |  rss diff: 49152 bytes
오후 9:48:53 server.1     |  rss before: 305774592 bytes
오후 9:48:53 server.1     |  rss after: 264060928 bytes
오후 9:48:53 server.1     |  rss diff: -41713664 bytes
오후 9:48:58 server.1     |  Find all translated documents (ja): 18.781s
오후 9:49:29 server.1     |  Find all translated documents (ko): 6.671s

### server restart (source-commit.json file exist) 

오전 1:43:41 server.1     |  Find all translated documents (ja): 14.861s
오전 1:43:49 server.1     |  Find all translated documents (ko): 7.179s

### reload

오후 10:09:03 server.1    |  Find all translated documents (ja): 1.389s
오후 10:08:35 server.1    |  Find all translated documents (ko): 1.274s

MacBook Air(M2, 8GB, 2022 model)

### server start (source-commit.json file not exist) 

(1)
오후 9:16:15 server.1     |  rss before: 230408192 bytes
오후 9:16:18 server.1     |  rss after: 251691008 bytes
오후 9:16:18 server.1     |  rss diff: 21282816 bytes
오후 9:16:19 server.1     |  rss before: 259538944 bytes
오후 9:16:19 server.1     |  rss after: 259637248 bytes
오후 9:16:19 server.1     |  rss diff: 98304 bytes
오후 9:16:23 server.1     |  rss before: 281657344 bytes
오후 9:16:23 server.1     |  rss after: 281690112 bytes
오후 9:16:23 server.1     |  rss diff: 32768 bytes
오후 9:16:26 server.1     |  Find all translated documents (ja): 12.284s
오후 9:17:23 server.1     |  Find all translated documents (ko): 4.172s

(2)
오후 9:43:25 server.1 | rss before: 228950016 bytes
오후 9:43:28 server.1 | rss after: 248283136 bytes
오후 9:43:28 server.1 | rss diff: 19333120 bytes
오후 9:43:29 server.1 | rss before: 254115840 bytes
오후 9:43:29 server.1 | rss after: 254197760 bytes
오후 9:43:29 server.1 | rss diff: 81920 bytes
오후 9:43:34 server.1 | rss before: 277807104 bytes
오후 9:43:34 server.1 | rss after: 277839872 bytes
오후 9:43:34 server.1 | rss diff: 32768 bytes
오후 9:43:36 server.1 | Find all translated documents (ja): 12.200s
오후 9:44:54 server.1 | Find all translated documents (ko): 4.227s

(3)
오후 9:48:37 server.1     |  rss before: 227262464 bytes
오후 9:48:39 server.1     |  rss after: 256802816 bytes
오후 9:48:39 server.1     |  rss diff: 29540352 bytes
오후 9:48:40 server.1     |  rss before: 258621440 bytes
오후 9:48:40 server.1     |  rss after: 258686976 bytes
오후 9:48:40 server.1     |  rss diff: 65536 bytes
오후 9:48:45 server.1     |  rss before: 280870912 bytes
오후 9:48:45 server.1     |  rss after: 280887296 bytes
오후 9:48:45 server.1     |  rss diff: 16384 bytes
오후 9:48:48 server.1     |  Find all translated documents (ja): 11.466s
오후 9:49:21 server.1     |  Find all translated documents (ko): 3.920s

### server restart (source-commit.json file exist) 

오후 10:06:29 server.1    |  Find all translated documents (ja): 12.128s
오후 10:06:39 server.1    |  Find all translated documents (ko): 3.867s

### reload

오후 10:06:31 server.1    |  Find all translated documents (ja): 272.352ms
오후 10:06:54 server.1    |  Find all translated documents (ko): 242.033ms

Surface Laptop2 (Intel® Core™ 8-i7, 8GB)

### server start (source-commit.json file not exist) 

(1)
오후 9:24:53 server.1     |  rss before: 261705728 bytes
오후 9:25:12 server.1     |  rss after: 274558976 bytes
오후 9:25:12 server.1     |  rss diff: 12853248 bytes
오후 9:25:21 server.1     |  rss before: 273227776 bytes
오후 9:25:21 server.1     |  rss after: 272945152 bytes
오후 9:25:21 server.1     |  rss diff: -282624 bytes
오후 9:25:56 server.1     |  rss before: 303996928 bytes
오후 9:25:56 server.1     |  rss after: 303996928 bytes
오후 9:25:56 server.1     |  rss diff: 0 bytes
오후 9:26:19 server.1     |  Find all translated documents (ja): 1:31.642 (m:ss.mmm)
오후 9:28:21 server.1     |  Find all translated documents (ko): 31.898s

(2)
오후 9:50:56 server.1     |  rss before: 265187328 bytes
오후 9:51:16 server.1     |  rss after: 277553152 bytes
오후 9:51:16 server.1     |  rss diff: 12365824 bytes
오후 9:51:24 server.1     |  rss before: 284254208 bytes
오후 9:51:24 server.1     |  rss after: 284266496 bytes
오후 9:51:24 server.1     |  rss diff: 12288 bytes
오후 9:52:07 server.1     |  rss before: 309460992 bytes
오후 9:52:07 server.1     |  rss after: 309469184 bytes
오후 9:52:07 server.1     |  rss diff: 8192 bytes
오후 9:52:43 server.1     |  Find all translated documents (ja): 1:52.605 (m:ss.mmm)
오후 9:53:18 server.1     |  Find all translated documents (ko): 30.816s

(3)
오후 10:01:01 server.1    |  rss before: 252461056 bytes
오후 10:01:22 server.1    |  rss after: 283996160 bytes
오후 10:01:22 server.1    |  rss diff: 31535104 bytes
오후 10:01:30 server.1    |  rss before: 288923648 bytes
오후 10:01:30 server.1    |  rss after: 275550208 bytes
오후 10:01:30 server.1    |  rss diff: -13373440 bytes
오후 10:02:12 server.1    |  rss before: 300892160 bytes
오후 10:02:12 server.1    |  rss after: 299499520 bytes
오후 10:02:12 server.1    |  rss diff: -1392640 bytes
오후 10:02:38 server.1    |  Find all translated documents (ja): 1:41.551 (m:ss.mmm)
오후 10:03:25 server.1    |  Find all translated documents (ko): 30.716s

### server restart (source-commit.json file exist) 

오전 1:34:51 server.1     |  Find all translated documents (ja): 31.946s
오전 1:34:19 server.1     |  Find all translated documents (ko): 18.694s

### reload

오후 10:04:09 server.1    |  Find all translated documents (ja): 3.377s
오후 10:04:21 server.1    |  Find all translated documents (ko): 2.752s

hochan222 avatar Apr 22 '23 18:04 hochan222

@LeoMcA is there anything blocking from getting this merged?

(cc @queengooborg working on similar topics)

SphinxKnight avatar Sep 27 '23 11:09 SphinxKnight

@mdn/mdn-community-engagement

SphinxKnight avatar Sep 27 '23 11:09 SphinxKnight

Any process on this PR?

awxiaoxian2020 avatar Mar 02 '24 01:03 awxiaoxian2020

@LeoMcA Hello.

I've verified that the page loads swiftly enough and is functioning properly.

I believe the current pull request is prepared for merging. Should there be any additional requirements or missing elements for the merge, please let me know🙇🙇

hochan222 avatar Mar 25 '24 16:03 hochan222