comprehensive-rust icon indicating copy to clipboard operation
comprehensive-rust copied to clipboard

Extract text more carefully in `mdbook-xgettext`

Open mgeisler opened this issue 2 years ago • 19 comments

Right now, we simply split the text on \n\n+, but this leads to a number of problems:

  • We split code blocks into different messages when there are one or more blank lines in the middle of the block.
  • We extract bullet point lists as a single message.

In general, it would be awesome if we could

  • Make the extracted messages independent of the precise formatting of the Markdown text. In particular, a hard-wrapped paragraph should be extracted without the paragraph breaks.
  • Remove formatting such as # from headers and * from bullet points.
  • Extract code blocks as a single message.

So Markdown like

# This is a heading

A _little_
paragraph.

```rust,editable
fn main() {
    println!("Hello world!");
}
```

* First
* Second

should result in these messages

  • This is a heading (heading type is stripped)
  • A _little_ paragraph. (softwrapped lines are unfolded)
  • fn main() {\n println!("Hello world!");\n} (info string is stripped)
  • First (bullet point extracted individually)
  • Second

You could imagine done something nice with links too: foo [bar](https://example.net) baz could be stored as foo [bar] baz. This might be a poor idea, though: it means that the translator cannot change the destination URL.

mgeisler avatar Feb 01 '23 12:02 mgeisler

When doing this, it's critically important that we run the same transformations on the existing po/*.po files. That way we can keep the work of the translators intact.

mgeisler avatar Feb 01 '23 13:02 mgeisler

why not use the markdown parser used by mdbook pulldown_cmark to extract paragraph and then reconstruct it?

moutikabdessabour avatar Feb 07 '23 18:02 moutikabdessabour

@moutikabdessabour we should definitely use pulldown_cmark for this!

mgeisler avatar Feb 08 '23 17:02 mgeisler

I'd like to work on this, if you don't mind assigning it to me. I can see how to replace the existing extract_paragraphs with a more sophisticated thing that emits a bunch of textual chunks.

run the same transformations on the existing po/*.po files

Did you have something "easy" in mind for this? My thinking was that this would be a kind of half-automated process, where with some iteration I could find a one-off way to translate all of the old msgid's to new msgid's, and then apply those to the po/*.po files, keeping the existing translation.

djmitche avatar Feb 17 '23 21:02 djmitche

While working on the Korean translation i found that keeping MD stuff(ie bullets) was helpful because it gave me freedom to do whatever fits better in the target language like splitting a single bullet into two when necessary. It's probably the same reason why mgeisler@ thought trimming links would be a poor idea.

jooyunghan avatar Feb 17 '23 21:02 jooyunghan

That makes a lot of sense. I think we could adjust the chunk-extraction to collapse adjacent list-item chunks into a single chunk.

djmitche avatar Feb 17 '23 23:02 djmitche

Did you have something "easy" in mind for this? My thinking was that this would be a kind of half-automated process, where with some iteration I could find a one-off way to translate all of the old msgid's to new msgid's, and then apply those to the po/*.po files, keeping the existing translation.

Yes, that was also roughly my idea. Basically that the new extraction functionality can be accessed from some temporary tool which will iterate over pairs of msgid and msgstr and apply the same extraction to those, producing yet more pairs. As you say, the idea is that a fully translated .po file should remain translated after running a new mdbook-xgettext followed by msgmerge.

I'm thinking this should be done in smaller steps and that each step should be carried out on the .po files in lock step:

Perhaps we can start by teaching mdbook-xgettext to do proper Markdown parsing via pulldown_cmark first. Use new_cmark_parser to get a Parser and then probably into_offset_iter to get something which has the needed byte offsets.

Next, I imagine it would be easy to extract fenced code blocks as a unit, and probably also easy to strip away # from headings.

I've been dabbling a bit with this myself and I think the biggest trouble will be to parse all of the different Tag variants. So one thought I had was to only parse the simple stuff at first and bail out if you see anything else. Bailing out would mean fall back to the naive \n\n+ splitting of the file. Most pages in the course are very simple: a heading, some text, a code block. At least they used to be like that, but many of them now have "speaker notes" which is a trailing <details> ... </details> block at the end.

While working on the Korean translation i found that keeping MD stuff(ie bullets) was helpful because it gave me freedom to do whatever fits better in the target language like splitting a single bullet into two when necessary.

I knew that the current system gives us that freedom, but I didn't know the freedom was used :smile: Can you tell us more about where you had to do this? My gut feeling is that we should try to improve the original English text in those cases.

mgeisler avatar Feb 19 '23 07:02 mgeisler

I got a start on this today in https://github.com/google/comprehensive-rust/pull/449.

I think this can get pretty close to producing the existing set of messages. This is probably a good place to start, and then update the .po files where they differ (number of newlines, maybe some funny business around <details>, etc.). Then a followup could wrap paragraphs, remove # from headers, break up bullet lists (if desired), and so on. The transformations on the .po files for these followups should be pretty straightforward.

djmitche avatar Feb 22 '23 23:02 djmitche

On experimenting a bit, I think we should leave lists as a unit for translation. The reason is, otherwise indentation is very hard to get right. For example, given

 * Always takes a single set of parameter types"

we get

msgid: "Always takes a single set of parameter types"

If the translation goes onto multiple lines, it's not at all obvious to the translator that this must be

msgstr: "Always"
"   takes"
"   a" ...

in order to keep the indentation correct. So, I will include lists in their entirety.

djmitche avatar Feb 24 '23 21:02 djmitche

Also, I don't think there's any automated way to re-break these messages. Some lists were broken into multiple messages by having \n\n between them, and some were not. I think the only way to go about this is manually editing the translation files :(

djmitche avatar Feb 24 '23 21:02 djmitche

If the translation goes onto multiple lines, it's not at all obvious to the translator that this must be

msgstr: "Always"
"   takes"
"   a" ...

in order to keep the indentation correct. So, I will include lists in their entirety.

Long-term, I would like to unwrap such paragraphs. So

* This is
  a single
  list item.

  Second paragraph
  in first item.

Becomes two messages in the .po file:

  • "This is a single list item."
  • "Second paragraph in first item."

Indentation and wrapping has been taken away. When translating the original text, we end up with

  • "* "
  • The translation of the first message
  • "\n\n "
  • The translation of the second message

This should work as long as there are no new \n characters in the two messages.

The goal (for me) is to remove the possibility of errors in the translations, and also to make the translations robust against changes in the formatting.

I would like to hear from @jooyunghan, @jiyongp, @rastringer, @hugojacob, and @ronaldfw if this is a good goal?

mgeisler avatar Feb 25 '23 15:02 mgeisler

I think it's okay to not unwrap softly wrapped text. It is sometimes even useful especially when translating a code fragment having translatable comments. What is annoying with po is that it doesn't support multi-line strings. Ideally, I wish the following. Not sure po file format supports it (but we could preprocess if not).

Markdown:

# This is a heading

A _little_
paragraph.

```rust,editable
fn main() { // translatable_comment_here
    println!("Hello world!");
}
\```

* First
* Second

po file:

msgid "This is a heading"

msgid """A _little_
paragraph."""

msgid """fn main() { // translatable_comment_here
    println!("Hello world!");
}"""

msgid "First"

msgid "Second"

jiyongp avatar Feb 26 '23 06:02 jiyongp

What is annoying with po is that it doesn't support multi-line strings.

The PO format uses C-style string and C-style string concatenation. So

msgid ""
"f"
"o"
"o"

is a msgid of "foo", with no newlines. This means that there are a myriad of different ways to represent the same string in the PO file.

When using msgmerge to update PO file, it will wrap strings at 80 columns by default, but it will also use embedded \n as good places to wrap.

I don't understand how having support for newlines in the strings in the PO file helps you here?

mgeisler avatar Feb 27 '23 15:02 mgeisler

I know that. But there are a few problems here:

  1. having to wrap each line with "..." is annoying, when you edit the po file with ordinary editors (e.g. vim). Multi-line string is much easier to deal with.
  2. poedit doesn't support this. it forcibly adds \n to every line you make. ex:

A translated text entered in poedit

f
o
o

becomes

msgid ""
"f\n"
"o\n"
"o"

jiyongp avatar Feb 28 '23 00:02 jiyongp

IMO, the problem of working with PO file directly is that we need to handle the stack of two encodings: PO file's C-style string literals (with escaping) over MarkDown text. I think that that's why @Jiyong's wished PO file supporting raw text.

My workflow now is that

  • Use poedit as a main editor for "MarkDown text". (no need to think about PO file formats)
  • Since I don't want poedit to do extra work, I unchecked both "line wrapping" and "preserve formatting"

jooyunghan avatar Feb 28 '23 02:02 jooyunghan

  • having to wrap each line with "..." is annoying, when you edit the po file with ordinary editors (e.g. vim). Multi-line string is much easier to deal with.

Thanks, I see what you mean now!

For that use case, I would suggest writing a tiny tool which transforms a .po file into a .yaml file or a similar format which can have multi-line strings. There are many services that can convert PO files to some sort of YAML, but I think you'll want something which simply outputs the msgid and msgstr fields in a long list. For fun, I wrote a po2yaml tool. This only converts one way — you'll want to also convert back to a PO file after editing the YAML file. The output looks like this:

- msgid: '# Running the Course'
  msgstr: '# 강의 진행 방식'
- msgid: '> This page is for the course instructor.'
  msgstr: '> 강사를 위한 안내 페이지입니다.'
- msgid: |-
    Here is a bit of background information about how we've been running the course
    internally at Google.
  msgstr: 다음은 구글 내부에서 이 과정을 어떤식으로 운영해왔는지에 대한 배경 정보입니다.
- msgid: 'To run the course, you need to:'
  msgstr: '강의를 실행하기 위한 준비:'
- msgid: |-
    1. Make yourself familiar with the course material. We've included speaker notes
       on some of the pages to help highlight the key points (please help us by
       contributing more speaker notes!). You should make sure to open the speaker
       notes in a popup (click the link with a little arrow next to "Speaker
       Notes"). This way you have a clean screen to present to the class.
  msgstr: 1. 강의 자료를 숙지합니다. 주요 요점을 강조하기 위해 일부 페이지에 강의 참조노트를 포함하였습니다. (추가적인 노트를 작성하여 제공해 주시면 감사하겠습니다.) 강의 참조 노트의 링크를 누르면 강의노트가 별도의 팝업으로 분리가 되며, 메인 화면에서는 사
라집니다.

As you can see, multi-line inputs end up as multi-line literal blocks in the YAML file — ready to be edited using your favorite tool :smile:

If you think this is useful, then we can probably put it somewhere.

mgeisler avatar Feb 28 '23 18:02 mgeisler

@mgeisler Yes, that looks great. I'd use it.

One question though: which file will be the source of truth? yaml, or po?

jiyongp avatar Mar 02 '23 05:03 jiyongp

@mgeisler Yes, that looks great. I'd use it.

One question though: which file will be the source of truth? yaml, or po?

I was thinking that you would generate the YAML file whenever you want locally and then export back to .po via a save hook in our editor. I was not thinking that it would be used by others, but if you find such a format useful, then go for it :smile:

We would need the YAML-to-PO conversion as well, but that should be trivial — we need the fuzzy markers as well, but the source lines (the filenames and line numbers) can be skipped since they come from the messages.pot file anyway.

mgeisler avatar Mar 04 '23 16:03 mgeisler

ack!

jiyongp avatar Mar 06 '23 06:03 jiyongp