gettext-extractor
gettext-extractor copied to clipboard
Feat: Optionally extract raw html instead of parse5 serialization
This adds a rawHtml
option to extract the actual source html instead of the parse5 roundtripped version; Not sure if its a good idea but I'm trying to replace a gettext extractor that does just this.
extractor
.createHtmlParser([
HtmlExtractors.elementContent('translate, [translate]', {
attributes: {
context: 'translate-context',
comment: 'translate-comment',
},
rawHtml: true,
}),
])
.parseFilesGlob('./src/**/*.html');
Documentation and lint needs fixing, but maybe its not a good idea to start with? ;-)
Can you go a bit more into detail on the problem your change addresses? Is this just about HTML entities (similar to #36) or do you have other issues with the extracted contents?
Yes, its about HTML entities, that is, roundtripping through parse5 loses information. In particular, the angular-gettext-cli extractor doesn't do that and …
extracts as literal. Now as a first step to replace it I wanted to reproduce the extracted po file in an existing project, and found that I was unable to do so for various html entities.
Now one might argue that this the correct way of doing things since the DOM does that as well, and you are going to match el.innerText
/ el.innerHTML
anyways. And I'm open to editing my po files to move html entites around. Still, it seems that for full flexibility one should at least be able to have po files where the msgid is either
- innerHTML
- innerText
- actual source of the template
Slightly related question: getElementContent has some special handling for <
, >
, and &
but not
even though thats also in the spec: https://html.spec.whatwg.org/multipage/parsing.html#escapingString