mercury-parser-api
mercury-parser-api copied to clipboard
Handle the full-text content in other language
The result of retrieving non English webpage is not encoded well. It returned the strings of hex digits (e.g. "δΈζ°η½") instead of encoded text. Is there a way to fix it? I tried the CLI version of Mercury Parser and pass the parameter --format markdown
, which resulting in correct text. But I have no idea how to add this kind of parameter in calling the mercury-parser-api. Please try the example URLs below to reproduce the problem:
- https://news.sina.com.cn/c/2021-01-23/doc-ikftssan9988691.shtml
- http://www.chinanews.com/sh/2021/01-24/9395190.shtml
Not sure if it's due to encodind, unfortunately I do not have time to investigate this now. https://github.com/postlight/mercury-parser/issues/425