mercury-parser-api icon indicating copy to clipboard operation
mercury-parser-api copied to clipboard

Handle the full-text content in other language

Open dawnyesky opened this issue 4 years ago β€’ 1 comments

The result of retrieving non English webpage is not encoded well. It returned the strings of hex digits (e.g. "中新网") instead of encoded text. Is there a way to fix it? I tried the CLI version of Mercury Parser and pass the parameter --format markdown, which resulting in correct text. But I have no idea how to add this kind of parameter in calling the mercury-parser-api. Please try the example URLs below to reproduce the problem:

  1. https://news.sina.com.cn/c/2021-01-23/doc-ikftssan9988691.shtml
  2. http://www.chinanews.com/sh/2021/01-24/9395190.shtml

dawnyesky avatar Jan 26 '21 02:01 dawnyesky

Not sure if it's due to encodind, unfortunately I do not have time to investigate this now. https://github.com/postlight/mercury-parser/issues/425

HenryQW avatar Jan 26 '21 16:01 HenryQW