parser
parser copied to clipboard
Feature request: Provide page's language in ParseResult
For further processing of websites it may be important to know the language of the page. For example to display text justified in the browser, specifying the language via <html lang="de">
is necessary.
Unfortunately at the moment mercury does not forward the language information from the input document to the ParseResult
.
It would be great if the ParseResult
could be extended by a language
property which contains the value of the <html>
-element's lang
-attribute.
Hi @svenwiegand, thanks for your feedback.
A quick and easy solution you may want to try is to add a custom type extension to your parse call.
Example via CLI
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend lang="html|lang"
Result
{
"title": "Mercury Goes Open Source! β Postlight β Digital Product Studio",
"author": "Adam Pash",
"date_published": "2019-02-06T14:36:45.000Z",
"dek": null,
"lead_image_url": "https://postlight.com/wp-content/uploads/2019/02/mercury-open-source-social-card-e1550670446269.png",
"content": "...content",
"next_page_url": null,
"url": "https://postlight.com/trackchanges/mercury-goes-open-source",
"domain": "postlight.com",
"excerpt": "Itβs my pleasure to announce that today, Postlight is open-sourcing the Mercury Web Parser. Written in JavaScript and running on both Node and in the ...",
"word_count": 436,
"direction": "ltr",
"total_pages": 1,
"rendered_pages": 1,
"lang": "en-US"
}
See https://github.com/postlight/mercury-parser/tree/master/src/extractors/custom#custom-types for more information on adding additional fields to the response.
Is there a way to do this without writing a custom extractor (or using the CLI)? It would feel a little silly to have to re-parse the whole page just to get the language ...