weblate icon indicating copy to clipboard operation
weblate copied to clipboard

Markdown files support [$250]

Open maicol07 opened this issue 5 years ago • 27 comments

Describe the solution you'd like Markdown files support (maybe similar to Crowdin system)

maicol07 avatar Oct 12 '19 15:10 maicol07

We currently do not support translating any documents, only formats designed for localization.

Related to #2592

nijel avatar Oct 13 '19 19:10 nijel

@nijel do you think will be implemented?

maicol07 avatar Oct 13 '19 19:10 maicol07

I think it will be implemented at some point. It should not be hard to implement (we already do something similar for the appstore metadata). Right now it's just not a priority for me, but this can change if somebody comes with funding for this :-).

nijel avatar Oct 14 '19 11:10 nijel

@nijel What sort of funding are we talking about here? Asking for a friend.

RMStoica-zivver avatar Nov 25 '19 15:11 RMStoica-zivver

@RMStoica-zivver You can use https://www.bountysource.com/issues/81891384-markdown-files-support to put funds on this issue to motivate contributors.

nijel avatar Nov 25 '19 22:11 nijel

I have added a bounty on this issue - it's not exactly clear to me what the integration should look like but I trust @nijel to steer the idea in the right direction.

Our own requirement is to translate documentation, represented by a set of Markdown files. Since these Markdown files will be stored in the same git repository as our UI i18n files (and the rest of our code), it would be ideal if the Markdown files could be added as an extra component in the Weblate project.

wetneb avatar Mar 26 '20 15:03 wetneb

Depends on https://github.com/translate/translate/issues/3956 which depends on https://github.com/miyuchina/mistletoe/pull/162

nijel avatar Mar 30 '20 10:03 nijel

I think it can be easily supported with some simple script, converting Markdown files to JSON files.

guoyunhe avatar Apr 28 '20 10:04 guoyunhe

If you want to go the conversion route, I recommend using po4a, a 20 year old project for doing just that. It recently got some key improvements in v0.58. Its used for things like f-droid.org and Fedora documentation.

eighthave avatar Apr 28 '20 11:04 eighthave

@eighthave good to know. thanks!

guoyunhe avatar Apr 28 '20 11:04 guoyunhe

About the idea of using po4a here, I think that it perfectly makes sense (disclamer: I'm one of the authors of po4a).

We already have an existing markdown parser, but it's ... not rock stable and changing it may be more complex than rewriting a new parser. What would remain is the surrounding infrastructure of po4a, which makes the conversion between documentation formats and PO files easier, and the tests.

My plan to improve the support of Markdown in po4a is to simplify the existing parser (its code is convoluted), and then improve its robustness using for example the tests from https://github.com/bobtfish/text-markdown/tree/master/t

Markdown is not very complex compared to other formats we handler pretty well in po4a (eg, groff of man pages or XML plus the docbook and HTML variants). For both formats, we use internal parses with no dependency to external tool or library. This is because the kind of parsing that we are doing is specific, so we felt it easier this way.

The groff parser is interesting in the sense that it really normalizes the input. There is maybe 6 ways to specify the inline formatting (bold, italic), and po4a converts them all to one form only to ease the life of translators. I'm not sure that it will be mandated for the markdown parser, but that's something to consider. The XML parser is interesting because it is difficult to have a line-by-line parser of XML, just as it happens to be in markdown. So the solution built in the XML parser could be useful to rework the markdown parser: instead of the line by line parser that we currently have, we could go for a block by block approach. That would help supporting the bits that are currently not supported.

Edit: we also have some format parsers that are using external tools in po4a. The POD parser is using a dedicated Perl library while the SGML parser is using the onsgml external parser. On another front, I am considering whether asciidoctor could be used as an external parser for the AsciiDoc format. If someone knows a parser for markdown that works a bit like a SAX parser, that may be an interesting starting point, maybe.

I'm willing to help any volunteer, but my personal schedule does not allow me to address this issue alone anytime soon.

Oh. And po4a is written in Perl. Sorry about that...

mquinson avatar Apr 29 '20 10:04 mquinson

What I mean is that there are a lot of important ideas in po4a that can be used in an Python/AST implementation:

  • the --keep option, e.g. a percentage translated that must be met, or the document reverts to the source language
  • automatic metadata like "markdown-text"
  • removing pure syntax strings from the translators view
  • custom YAML Front Matter handling

But in a broader sense, I think a po4a mode might make sense for Weblate to handle formats like asciidoc, groff/man, etc. I wasn't thinking to use po4a directly in Weblate to handle Markdown, though that might be a quick fix for this. I think that having access to the Markdown AST will enable so many really useful possibilities, it will be worth the work.

eighthave avatar Apr 29 '20 11:04 eighthave

If I may, from the point of view of the translator block-by-block makes a lot of sense, as one block will be one paragraph, or one list etc. Also I would suggest maybe taking a look at pandoc

RMStoica-zivver avatar Apr 29 '20 12:04 RMStoica-zivver

Considering how all "inline HTML" is valid markdown, I would suggest approaching markdown files by doing md -> html then simply using translate-toolkit's html support. Perhaps another small layer to handle front matter. That should make it a little more straight forward to implement a complete markdown support.

I do apologize in case I've missed something about the problem. Only came across this project on bountysource a few hours back

akumar-xyz avatar May 04 '20 15:05 akumar-xyz

The Markdown AST parser libraries will understand the HTML components, and then let us work directly with the AST.

eighthave avatar May 05 '20 13:05 eighthave

I'm adding 1 Monero (XMR value at the moment: $346 - updated value) to the bounty to be awarded to the person who will resolve this issue.

I would have used Bountysource but the related bounty can only be funded through Paypal. To send the bounty i will need a Monero address (or, if preferred, the address of another cryptocurrency. Like BTC or ETH).

erciccione avatar Feb 25 '21 14:02 erciccione

Reminder that there are two bounties on this issue: $240 + 1XMR. This feature is very much needed.

erciccione avatar May 23 '21 12:05 erciccione

Hi, how can I do this now? With po4a? What are the steps? Say I have hello.md with "Hello world \n Bye all" in it, and would like it to be translatable, for example.

Svetlana-T avatar May 28 '21 04:05 Svetlana-T

Hi, how can I do this now? With po4a? What are the steps? Say I have hello.md with "Hello world \n Bye all" in it, and would like it to be translatable, for example.

Create a po4a.conf file (name doesn't matter) in a po/ subdirectory with the content

[po4a_langs] fr es it de
[po4a_paths] po/mysite.pot $lang:po/mysite.$lang.po

[options] opt:"--addendum-charset=UTF-8" opt:"--localized-charset=UTF-8" opt:"--master-charset=UTF-8" opt:"--master-language=en_US" opt:"--msgmerge-opt='--no-wrap'" opt:"--porefs=file" opt:"--wrap-po=newlines"

[po4a_alias:markdown] text opt:"--option markdown" opt:"--option yfm_keys=title" opt:"--addendum-charset=UTF-8" opt:"--localized-charset=UTF-8" opt:"--master-charset=UTF-8" opt:"--keep=0"

[type: markdown] content/hello.md $lang:content/$lang/hello.md
[type: markdown] content/goodbye.md $lang:content/$lang/goodbye.md

Then run

po4a po/po4a.conf

If using git, you can add these sorts of rules into .gitignore:

# no need to translate the source language, but po4a gens this file
po/mysite.en.po

# po4a auto-generated markdown files from translations
content/[a-z][a-z]/*.md
content/[a-z][a-z][a-z]/*.md
content/[a-z][a-z][a-z]_[A-Z]*/*.md
content/[a-z][a-z]_[A-Z]*/*.md

Now there is also this tool from KDE: https://invent.kde.org/websites/hugo-i18n

ilmari-lauhakangas avatar May 28 '21 05:05 ilmari-lauhakangas

Here are some examples of sites doing this with po4a: https://gitlab.com/fdroid/fdroid-website/ https://github.com/fsfe/reuse-docs/pull/61

eighthave avatar May 28 '21 06:05 eighthave

I have not tried it but I suspect the main issue with this workflow is that translators get to translate parts of markdown files out of context, no?

In OpenRefine we are sadly going to go for Crowdin (for now), because it seems to be the only solution which offers a real markdown editor where you can see the entire file being translated while still working on individual parts.

If people are interested in adding a similar Markdown support in Weblate, I could imagine finding some funding for it (the existing bounties will not get us very far I am afraid). Maybe we could pool resources with other projects interested in the feature?

wetneb avatar May 28 '21 09:05 wetneb

I have not tried it but I suspect the main issue with this workflow is that translators get to translate parts of markdown files out of context, no?

In OpenRefine we are sadly going to go for Crowdin (for now), because it seems to be the only solution which offers a real markdown editor where you can see the entire file being translated while still working on individual parts.

I wouldn't extend the scope of this issue to include such a nice-to-have feature.

ilmari-lauhakangas avatar May 28 '21 09:05 ilmari-lauhakangas

Is this issue is only about getting the strings out of Markdown and translating them? I would suggest supporting something similar to Crowdin's documentation localization offering.

yarons avatar Nov 28 '21 16:11 yarons

last I looked, Crowdin's Markdown support was limited but better than nothing. The best way would be to actually use the AST (Abstract Syntax Tree). That means Markdown becomes structured data like JSON, YAML, XML, etc.

eighthave avatar Nov 29 '21 13:11 eighthave

I'm talking about the fact that you can see the end result in a preview pane while translating, Mozilla's Pontoon also offer such capability.

yarons avatar Nov 29 '21 17:11 yarons

We're translating markdown articles via weblate. Initially, we wanted to translate by inserting plain text into Weblate, but Weblate was pretty bad at handling insertions and deletions of paragraphs in the original text.

I've looked into po4a and other ways to convert the text into formats that would allow us to easily translate and update the text, but haven't found anything that would be easy to use and wouldn't generate lots of overhead.

So I've written a simple Golang package that splits text into paragraphs, compares to the previous version of the text (if there is one), and produces JSON of a map from keys to paragraphs in a way that keeps the paragraphs in the right order, doesn't change the keys if the text wasn't significantly changed and handles insertions and deletions in a way that avoids key collisions. It should be pretty easy to write something similar and add to how Weblate handles plain text; but if anyone's interested, I can add the documentation, examples, etc. to the tool I've written.

Mihonarium avatar May 27 '22 14:05 Mihonarium

Handling of plain text files will work better since 4.13, see https://github.com/WeblateOrg/weblate/pull/7585

nijel avatar Jun 06 '22 13:06 nijel

Thank you for your report; the issue you have reported has just been fixed.

  • In case you see a problem with the fix, please comment on this issue.
  • In case you see a similar problem, please open a separate issue.
  • If you are happy with the outcome, don’t hesitate to support Weblate by making a donation.

github-actions[bot] avatar Aug 01 '23 18:08 github-actions[bot]