docusaurus icon indicating copy to clipboard operation
docusaurus copied to clipboard

Feat: Support LLMs for i18n Translation + remove CrowdIn ads

Open ardigan6 opened this issue 6 months ago • 10 comments

Have you read the Contributing Guidelines on issues?

Description

Add i18n auto-translation support using LLMs via a flexible API.

Has this been requested on Canny?

No response

Motivation

Modern LLMs are better technical translators than the average CrowdIn human.

It would make sense to fully automate the translation flow in CI, and possibly deprecate CrowdIn ~support~ mentions in the future.

API design

docusaurus --translation-source-lang=en --translation-target-langs=fr,de --translation-llm-endpoint=[openai-compatible router]

Have you tried building it?

This should be a core feature.

Self-service

  • [ ] I'd be willing to contribute this feature to Docusaurus myself.

ardigan6 avatar Jun 02 '25 13:06 ardigan6

Docusaurus is based on file system paths.

We only document Crowdin usage for historical reasons, because Docusaurus v1 sites used it in 2017, and we wanted to help them upgrade to Docusaurus v2 (including ours and other sites at Meta). But we don't have a tight coupling to Crowdin, nor do we have any integration code for it. You don't have to use Crowdin, and we don't plan to deprecate the integration we don't even have.

The community can build a CLI to automatically translate MDX content. The only requirement is to respect the file system conventions we have.

We could be interested having this a first-class Docusaurus feature, but I'd rather see a successful community implementation first, so that we understand what works and what doesn't.

slorber avatar Jun 02 '25 13:06 slorber

Docusaurus is based on file system paths.

We only document Crowdin usage for historical reasons, because Docusaurus v1 sites used it in 2017, and we wanted to help them upgrade to Docusaurus v2 (including ours and other sites at Meta). But we don't have a tight coupling to Crowdin, nor do we have any integration code for it. You don't have to use Crowdin, and we don't plan to deprecate the integration we don't even have.

The community can build a CLI to automatically translate MDX content. The only requirement is to respect the file system conventions we have.

We could be interested having this a first-class Docusaurus feature, but I'd rather see a successful community implementation first, so that we understand what works and what doesn't.

Yes, that was a bit unclear. I was referring to the README mentions, docs tutorial, etc. Hard to imagine >1% of docusaurus users wanting to use CrowdIn in 2025.

There are several LLM translation scripts available, e.g. https://github.com/moonrailgun/docusaurus-i18n https://github.com/DanRoscigno/translate_changed_files

My version uses a different approach:

  1. parses the mdx,
  2. extracts strings only from header/table/string blocks (e.g. ignoring code),
  3. replaces any code in the string with $CODEVAR[n], and then
  4. chunks those strings into hash-keyed JSON ({"3d653119": "hello"}) for each supported language,
  5. sends chunk to LLM endpoint for i18n, then
  6. sends the output along with the original English to a second LLM to review, mark any hashes with errors, and provide an alternative translation for them, then
  7. copies base files to each language and applies string replacements on top of them.

This works well, enables granular and efficient caching, and avoids many issues with more naive approaches that break mdx formatting or embedded code blocks.

Here is a simplified version of this approach for reference: https://github.com/ardigan6/docusaurus-llm-translator

ardigan6 avatar Jun 02 '25 21:06 ardigan6

Hard to imagine >1% of docusaurus users wanting to use CrowdIn in 2025.

We use it for Meta sites and don't plan to migrate away in the short term.

This solution is not mandatory, so if you don't like it, just don't use it. There's no need to criticize it. Adding automatic LLM translations and removing Crowdin from our docs are totally different topics.


This works well, enables granular and efficient caching, and avoids many issues with more naive approaches that break mdx formatting or embedded code blocks.

Here is a simplified version of this approach for reference: ardigan6/docusaurus-llm-translator

If you are already able to achieve your need with your own code, what's the value of adding this to Docusaurus core?

slorber avatar Jun 03 '25 07:06 slorber

Hard to imagine >1% of docusaurus users wanting to use CrowdIn in 2025.

We use it for Meta sites and don't plan to migrate away in the short term.

Have you benchmarked this against the latest models? CrowdIn vendor results were uniformly inferior for the languages I speak, and of course much slower than an API call.

If you are already able to achieve your need with your own code, what's the value of adding this to Docusaurus core?

I've seen very few Docusaurus sites in the wild that enable i18n, which is a shame.

Probably this is somewhat related to the difficulty in auto-translating the content compared to a traditional CMS, where structured storage means string replacements at the field level are easy to handle safely.

Building it into core would make it simpler to reuse AST walking rather than my regex strategy, which was quick to write and works in my case but undoubtedly would fail on some docs.

ardigan6 avatar Jun 03 '25 12:06 ardigan6

@ardigan6 you can take a look at our setup if you'd like https://github.com/ClickHouse/clickhouse-docs. Currently we translate from English to Chinese, Japanese and Russian.

Blargian avatar Jul 12 '25 12:07 Blargian

https://github.com/ClickHouse/clickhouse-docs

@Blargian MarkdownNodeParser is a nice find, thanks. Forcing static refs (### Supported data types {#supported-data-types} etc) is also a good trick for avoiding URL breakage. Is that automatically generated and inserted somewhere?

ardigan6 avatar Jul 16 '25 18:07 ardigan6

https://github.com/ClickHouse/clickhouse-docs

@Blargian MarkdownNodeParser is a nice find, thanks. Forcing static refs (### Supported data types {#supported-data-types} etc) is also a good trick for avoiding URL breakage. Is that automatically generated and inserted somewhere?

@ardigan6, we ran a script to insert them across our docs when we first introduced the change but we don't auto generate them with new additions to the docs, we just have a markdown lint rule that fails CI if people don't insert them. Adding them automatically is a nice idea for an improvement actually.

Deployment is a little messy btw. We have separate Vercel deploys for each translation and the web worker takes care of which deploy to route to based on if someone navigates /ru, /zh, /jp. Not exactly following the intended i18n approach but it works for us.

Blargian avatar Jul 16 '25 19:07 Blargian

https://github.com/ClickHouse/clickhouse-docs

@Blargian MarkdownNodeParser is a nice find, thanks. Forcing static refs (### Supported data types {#supported-data-types} etc) is also a good trick for avoiding URL breakage. Is that automatically generated and inserted somewhere?

@ardigan6, we ran a script to insert them across our docs when we first introduced the change but we don't auto generate them with new additions to the docs, we just have a markdown lint rule that fails CI if people don't insert them. Adding them automatically is a nice idea for an improvement actually.

Yes, might borrow a few of these ideas for my version :) (https://github.com/ardigan6/docusaurus-llm-translator)

Deployment is a little messy btw. We have separate Vercel deploys for each translation and the web worker takes care of which deploy to route to based on if someone navigates /ru, /zh, /jp. Not exactly following the intended i18n approach but it works for us.

Surprised N deploys for every change was easier for you, what was the goal there?

ardigan6 avatar Jul 16 '25 22:07 ardigan6

https://github.com/ClickHouse/clickhouse-docs

@Blargian MarkdownNodeParser is a nice find, thanks. Forcing static refs (### Supported data types {#supported-data-types} etc) is also a good trick for avoiding URL breakage. Is that automatically generated and inserted somewhere?

@ardigan6, we ran a script to insert them across our docs when we first introduced the change but we don't auto generate them with new additions to the docs, we just have a markdown lint rule that fails CI if people don't insert them. Adding them automatically is a nice idea for an improvement actually.

Yes, might borrow a few of these ideas for my version :) (https://github.com/ardigan6/docusaurus-llm-translator)

Deployment is a little messy btw. We have separate Vercel deploys for each translation and the web worker takes care of which deploy to route to based on if someone navigates /ru, /zh, /jp. Not exactly following the intended i18n approach but it works for us.

Surprised N deploys for every change was easier for you, what was the goal there?

We kick off the translations only once a month generally so it’s not deploying each time on all changes. We have a lot of pages (1300+) so deploying 4 sets of docs would increase build time a lot if we had to do it all at once. This way we keep build times short for the English docs which are frequently updated and just accept that there is some delay for the translations as a tradeoff.

The translation script only translates files with changes in any case, so we can always rerun it on an adhoc basis if there is some newly added content we really need to be translated less than monthly.

Blargian avatar Jul 17 '25 06:07 Blargian

Great discussion here! I've been following along and wanted to add my perspective, as I've been building a complete end-to-end translation pipeline for Docusaurus based on these same principles.

To validate the approach, I created a framework that takes a Docusaurus GitHub repo and target locales, and automatically generates a fully translated site preview on Vercel. For anyone interested in seeing this in action, I'd be happy to provide free alpha testing for community experimentation - please shoot me your GitHub url!

The points raised by @ardigan6, @slorber, and @Blargian are all spot-on. The challenge has two main parts: the quality of the LLM translation itself, and the technical robustness of the integration. The discussion below focuses on a technical part as I learned in the process.

  1. AST-based Parsing and Segmentation:

    • It is much more robust to reuse (or build upon) Docusaurus's internal loaders (based on unified/remark/rehype) for perfect compatibility.
    • The AST allows us to precisely identify and extract only translatable content (paragraphs, list items, table cells, JSX string props) while preserving code blocks, front matter, and MDX syntax.
    • We can also intelligently handle hybrid content, like either extracting comments from code blocks for translation or simply feeding the whole node to a clever LLM with good prompting.
  2. A TMS-inspired Workflow:

    • The model used by Translation Management Systems is still relevant. At its core, this is a centralized Translation Memory (TM)—essentially a dictionary that maps source text segments to their translations ({"source_hash": {"locale": "translation"}}). This can be managed at a file and/or global level.
    • This TM allows for efficient caching and re-translating only what has changed, forming the backbone of an automated process.
  3. Efficient and Parallel Translation:

    • To process a large number of changes quickly, API calls for translations should be run in parallel.
    • The system must support batching requests and throttling to respect LLM rate limits and control costs effectively.
  4. Quality Assurance and Validation:

    • LLMs can make mistakes that break syntax. The same AST parser used for extraction should be used as a post-translation validator to ensure the reconstructed MDX is still valid.
    • (Optional) A multi-step LLM process can drastically improve quality: a first pass for translation, followed by a second "review" or "fix" pass, can be highly effective.
  5. Handling Docusaurus Specifics:

    • The pipeline must automatically manage details like updating relative links in translated files and preserving heading IDs to prevent broken links—a great point from the earlier discussion.
  6. Integrated Translation Generation:

    • A new command, perhaps an enhanced docusaurus write-translations, would be responsible for parsing all source MDX and generated JSON files.
    • It would update the global TM by identifying missing translations and then send only those segments to an LLM endpoint. The LLM translation strategy here is essential; it requires balancing context preservation (sending larger, coherent chunks) with cost-efficiency and performance (smaller, granular requests).
  7. On-Demand Translation Injection:

    • The Docusaurus build process for a specific locale (docusaurus build --locale fr) would remain largely independent. During the build, it would use the TM to inject translations on-the-fly into the content stream before rendering. This avoids needing all translated files physically present.
  8. Scalable Build Strategies:

    • To address build performance on large sites, the approach used by the ClickHouse team—building each locale as a separate, independent deployment—is a proven solution. This model works perfectly with the on-demand injection described above, as each build only needs to pull its specific locale from the TM. This keeps build times and memory usage manageable.

philodoxos avatar Aug 05 '25 09:08 philodoxos