parser icon indicating copy to clipboard operation
parser copied to clipboard

Feature request: Provide page's language in ParseResult

Open svenwiegand opened this issue 4 years ago β€’ 2 comments

For further processing of websites it may be important to know the language of the page. For example to display text justified in the browser, specifying the language via <html lang="de"> is necessary.

Unfortunately at the moment mercury does not forward the language information from the input document to the ParseResult.

It would be great if the ParseResult could be extended by a language property which contains the value of the <html>-element's lang-attribute.

svenwiegand avatar Aug 31 '19 14:08 svenwiegand

Hi @svenwiegand, thanks for your feedback.

A quick and easy solution you may want to try is to add a custom type extension to your parse call.

Example via CLI mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend lang="html|lang"

Result

{
  "title": "Mercury Goes Open Source! β€” Postlight β€” Digital Product Studio",
  "author": "Adam Pash",
  "date_published": "2019-02-06T14:36:45.000Z",
  "dek": null,
  "lead_image_url": "https://postlight.com/wp-content/uploads/2019/02/mercury-open-source-social-card-e1550670446269.png",
  "content": "...content",
  "next_page_url": null,
  "url": "https://postlight.com/trackchanges/mercury-goes-open-source",
  "domain": "postlight.com",
  "excerpt": "It’s my pleasure to announce that today, Postlight is open-sourcing the Mercury Web Parser. Written in JavaScript and running on both Node and in the ...",
  "word_count": 436,
  "direction": "ltr",
  "total_pages": 1,
  "rendered_pages": 1,
  "lang": "en-US"
}

See https://github.com/postlight/mercury-parser/tree/master/src/extractors/custom#custom-types for more information on adding additional fields to the response.

mtashley avatar Sep 20 '19 16:09 mtashley

Is there a way to do this without writing a custom extractor (or using the CLI)? It would feel a little silly to have to re-parse the whole page just to get the language ...

solarkraft avatar Jan 29 '22 04:01 solarkraft