gettext-extractor Feat: Optionally extract raw html instead of parse5 serialization

Feat: Optionally extract raw html instead of parse5 serialization

Open vbraun opened this issue 4 years ago • 2 comments

This adds a rawHtml option to extract the actual source html instead of the parse5 roundtripped version; Not sure if its a good idea but I'm trying to replace a gettext extractor that does just this.

extractor
    .createHtmlParser([
        HtmlExtractors.elementContent('translate, [translate]', {
            attributes: {
                context: 'translate-context',
                comment: 'translate-comment',
            },
            rawHtml: true,
        }),
    ])
    .parseFilesGlob('./src/**/*.html');

Documentation and lint needs fixing, but maybe its not a good idea to start with? ;-)

Aug 03 '20 18:08 vbraun

Can you go a bit more into detail on the problem your change addresses? Is this just about HTML entities (similar to #36) or do you have other issues with the extracted contents?

Aug 06 '20 14:08 lukasgeiter

Yes, its about HTML entities, that is, roundtripping through parse5 loses information. In particular, the angular-gettext-cli extractor doesn't do that and … extracts as literal. Now as a first step to replace it I wanted to reproduce the extracted po file in an existing project, and found that I was unable to do so for various html entities.

Now one might argue that this the correct way of doing things since the DOM does that as well, and you are going to match el.innerText / el.innerHTML anyways. And I'm open to editing my po files to move html entites around. Still, it seems that for full flexibility one should at least be able to have po files where the msgid is either

innerHTML
innerText
actual source of the template

Slightly related question: getElementContent has some special handling for <, >, and & but not   even though thats also in the spec: https://html.spec.whatwg.org/multipage/parsing.html#escapingString

Aug 06 '20 15:08 vbraun

gettext-extractor gettext-extractor copied to clipboard

Feat: Optionally extract raw html instead of parse5 serialization

gettext-extractor
gettext-extractor copied to clipboard