mdBook Add commands for Gettext-based translations i18n

This implements the Gettext-based translation support mentioned in https://github.com/rust-lang/mdBook/issues/5#issuecomment-1187504713. Gettext is a wide-used standard for translating software, with many tools available for translators to maintain and update the translations.

I added two new top-level commands:

mdbook xgettext: will extract all strings into a messages.pot file, similar to how xgettext works for source code,
mdbook gettext: will use a xx.po file to generate a translated source tree, similar to how gettext works.

The names don't feel great to me, since they assume that one is already familiar with the Gettext system. Perhaps it would be better to have mdbook i18n extract and mdbook i18n translate or similar?

The translated source tree can be used together with the language support from #1306 to get a multi-lingual book.

Jul 24 '22 21:07 mgeisler

While this seems to work, I marked this as a draft since I'm sure we need some discussion here.

Jul 24 '22 21:07 mgeisler

Hey @sebras, this is the PR I was working on with the extract and reconstruct scripts — they're no longer scripts but now top-level mdbook commands just so that I could hook into the MDBook struct and easily iterate over the book content.

Please let me know how this works for you — I'll also be testing it out here over the next few weeks.

Jul 25 '22 12:07 mgeisler

I'm marking this as non-draft since I would love to get some feedback from people on this.

Aug 09 '22 19:08 mgeisler

Hi @mgeisler, i wanted to test this pull request, however i get an error message upon cargo build:

error[E0433]: failed to resolve: could not find `gettext` in `cmd`
  --> src/main.rs:38:48
   |
38 |         Some(("gettext", sub_matches)) => cmd::gettext::execute(sub_matches),
   |                                                ^^^^^^^ could not find `gettext` in `cmd`

error[E0433]: failed to resolve: could not find `gettext` in `cmd`
  --> src/main.rs:82:26
   |
82 |         .subcommand(cmd::gettext::make_subcommand())
   |                          ^^^^^^^ could not find `gettext` in `cmd`

Is there something missing?

Sep 08 '22 10:09 aellwein

Is there something missing?

Ups! Yes, there is... I had not added a pub mod gettext; line to src/cmd/mod.rs. Thanks for catching that!

I've updated the branch, please give it a try again.

Sep 08 '22 15:09 mgeisler

@mgeisler, i'm sorry for the delay, it took me some time to test the PR, but first of all, thank you for your work.

I've tried to create some example content, everything works well but it was not quiet what i've expected.

xgettext command simply converted every line of my chapter into a separate message, but this approach appears very tiresome to me, just because the whole text is split in lines and it's hard to read and follow the context and translate afterwards.

In my opinion gettext makes sense, when you are expecting single messages to be translated out of the context (like program info boxes, error messages, buttons etc.), but in creation of a book it's usually the whole text of a chapter which is to be translated (with maybe some small exceptions).

So at least in my expectation, a chapter-by-chapter approach fits better here: i could imagine writing something like chapter1.<lang>.md and chapter1.<other_lang>.mdand just having a simple language switch in my generated markdown book to switch between different languages.

So i would like to know what others think about it, if gettext approach is feasible for book writers.

Sep 13 '22 18:09 aellwein

So at least in my expectation, a chapter-by-chapter approach fits better here: i could imagine writing something like chapter1.<lang>.md and chapter1.<other_lang>.mdand just having a simple language switch in my generated markdown book to switch between different languages.

In other project where I have been translating online documentation and websites they tend to separate out each paragraph into a gettext translatable message. That gives the translator enough context while also not being overly long as entire chapters may be. Moreover a paragraph per message makes it easier to identify any changes per revision, if the message is too long it may be difficult to identify all differences. Finally, paragraphs may move around unchanged between different revisions, and then having each paragraph as a gettext message would not require retranslation (whereas an entire chapter would).

PS. These are just general observations from the position of a translator, I have not tested this proposed PR yet.

Sep 13 '22 18:09 sebras

@mgeisler, i'm sorry for the delay, it took me some time to test the PR, but first of all, thank you for your work.

No worries at all, thanks a lot for giving it a go!

I've tried to create some example content, everything works well but it was not quiet what i've expected.

xgettext command simply converted every line of my chapter into a separate message, but this approach appears very tiresome to me, just because the whole text is split in lines and it's hard to read and follow the context and translate afterwards.

Right, I fully intended to extract paragraphs (lines of text between \n\n+) and not individual lines.

I just tried with cargo run -- xgettext inside the test_book directory of this repository. The resulting messages.pot file looks like this:

#: individual/list.md:1
msgid "# Lists"
msgstr ""

#: individual/list.md:3
msgid ""
"1. A\n"
"2. Normal\n"
"3. Ordered\n"
"4. List"
msgstr ""

#: individual/list.md:8
msgid "---"
msgstr ""

This corresponds to

# Lists

1. A
2. Normal
3. Ordered
4. List

---

I think that's what we both wanted: lines of text is kept together unless it is separated by \n\n+. Do you see something else? Could it perhaps be that you're on Windows? I wrote the code to split on \n only, but I don't see why it could not split on \r\n as well.

Now, this list example is perhaps a poor example: I've been wondering if it makes sense to parse the Markdown more carefully and emit individual msgids for each list item. Similarly, a heading like ## My heading could be put into the messages.pot file as simply My heading. That way the translators will have less markup to deal with (but also slightly less context).

Sep 13 '22 20:09 mgeisler

So at least in my expectation, a chapter-by-chapter approach fits better here: i could imagine writing something like chapter1.<lang>.md and chapter1.<other_lang>.mdand just having a simple language switch in my generated markdown book to switch between different languages.

My experience with this is that it becomes impossible to track changes after a little while. This is in some sense an important role of the structured files created by Gettext: they give you a way to unambiguously say these 17 paragraphs are out of date.

If you just have a stream of changes to chapter1.<lang>.md, then it suddenly becomes a management task of the translator to track where the chapter1.<other_lang>.md file is in relationship to the source. Yes, it's doable, but it would require that the translator would write something like  at the top of the file.

When text is added and removed from the source file, the translator will now have to apply these changes — perhaps a paragraph is added on Monday and revised on Tuesday and Wednesday. If the translator sees this Friday, then they have to manually notice that they can avoid translating the text from Monday and Tuesday and only translate the version from Wednesday.

The "buffer" in the messages.pot file helps here: the translator start the workflow Friday morning by extracting all strings to mesages.pot. This file is then merged into other_lang.po. The translator now sees exactly that needs to be translated and they see what is "fuzzy" because only minor changes have been made to the source paragraph.

Sep 13 '22 20:09 mgeisler

I think that's what we both wanted: lines of text is kept together unless it is separated by \n\n+. Do you see something else? Could it perhaps be that you're on Windows? I wrote the code to split on \n only, but I don't see why it could not split on \r\n as well.

No, i am not on Windows, but i added additional line breaks after the sentences for better styling (my test text was a poem), this could be the reason.

Now, this list example is perhaps a poor example: I've been wondering if it makes sense to parse the Markdown more carefully and emit individual msgids for each list item. Similarly, a heading like ## My heading could be put into the messages.pot file as simply My heading. That way the translators will have less markup to deal with (but also slightly less context).

Yes, may be it's a good idea to have more "semantic" parsing of Markdown.

Sep 14 '22 05:09 aellwein

No, i am not on Windows, but i added additional line breaks after the sentences for better styling (my test text was a poem), this could be the reason.

I see, was the poem perhaps indented or in a block quote? That is,

> foo
> bar

will be put into a single msgid. The same happens with the two quoted paragraphs in this example:

> foo
>
> bar

I think there could be a lot of benefit from parsing away such block-level markup and put foo and bar into their own msgid. Similar for code blocks, headings, and list items.

If we parse a list with 3 items into 3 msgids, then there's no way for a translator to add/remove list items. Right now, it seems like that's okay since it can help prevent translation mistakes.

Sep 14 '22 07:09 mgeisler

Hi, I'm trying to do some translation with your code and https://github.com/rust-lang/mdBook/pull/1306. Here are the steps I took：

mdbook xgettext
msguniq messages.pot -o messages.pot
msginit -i messages.pot --local zh.po
mdbook gettext zh.po

Then I use mdbook from https://github.com/rust-lang/mdBook/pull/1306 to build, but get this error:

[ERROR] (mdbook::utils): Error: Couldn't open SUMMARY.md in "/home/trdthg/myproject/flutter_rust_bridge/book/src/zh" directory

https://github.com/rust-lang/mdBook/pull/1306 needs the translated book to have its own SUMMARY.md. So do I have to translate and copy it manually?

Btw, cloning two extra copies of mdbook is a bad experience）

Sep 17 '22 12:09 trdthg

Hi @trdthg Thanks so much for testing this out!

#1306 needs the translated book to have its own SUMMARY.md. So do I have to translate and copy it manually?

You're completely right that I missed the generation of the SUMMARY.md file. I've pushed a new version of the branch which will also translate this file.

Btw, cloning two extra copies of mdbook is a bad experience）

Yeah, I agree... Perhaps @Ruin0x11 could rebase the branch on top of the latest master so that I in turn can rebase my branch on top. I just looked at the history and I see that the commits are 1-2 years old... so this might be much more work than I had hoped.

Sep 18 '22 20:09 mgeisler

If I understand correctly this adds better support for translator focused tooling to my original code, is that accurate? I don't mind rebasing again, but I want to make sure there are no blockers for integrating the original code like last time.

Sep 19 '22 00:09 Ruin0x11

If I understand correctly this adds better support for translator focused tooling to my original code, is that accurate?

Yes, that is precisely the idea. The new commands in this PR allows for a Gettext based workflow for translations. The result is a tree of files which mirror the original files — a tree which should be ready to be put under src/xx/ for the xx language.

I want to make sure there are no blockers for integrating the original code like last time.

Just to be clear, I'm not a developer on the project — I'm just using mdbook myself for training materials and I would like to be able to translate this material to other languages.

Sep 19 '22 12:09 mgeisler

Okay, thanks for clarifying, I'm also not a major contributor to mdBook, but shared the same need for multilingual support at one point. I'm happy to collaborate if there's some way of getting traction on these code changes.

Sep 19 '22 19:09 Ruin0x11

Hi all, I'll close this PR in favor of https://github.com/google/comprehensive-rust/pull/130. It's the same code there, but it's refactored to not require any changes of mdbook. Instead, I use a renderer (output format) to extract the strings and a preprocessor to do the translations.

You can reuse these tools in your own projects! Please let me know if you do so that we can figure out if we should publish them on crates.io.

Jan 09 '23 13:01 mgeisler

Just in case someone finds this much later: the tooling has been released as a set of mdbook plugins: https://github.com/google/mdbook-i18n-helpers.

May 19 '23 16:05 mgeisler

Hi all, the latest version of mdbook-i18n-helpers significantly improves on how the text is extracted by removing unnecessary Markdown syntax. Please try it out if you're still interested in translating your mdbook documentation!

Aug 23 '23 18:08 mgeisler

mdBook mdBook copied to clipboard

Add commands for Gettext-based translations i18n

mdBook
mdBook copied to clipboard