parser
parser copied to clipboard
Invalid decoded text
- Platform: Mac
- Mercury Parser Version: 2.1.0
- Node Version (if a Node bug): 10
- Browser Version (if a browser bug):
Expected Behavior
Parsed HTML should be properly encoded as per the original text
Current Behavior
The parsed html contains in invalid text .Might be because of decoding issue.
Steps to Reproduce
- Fetch the html using any client
- Pass that to the parse using
Mercury.parse(url,{html:fetchedHtml})
- Returned HTML contains incorrectly decoded text
Some Links: https://www.newyorker.com/culture/the-new-yorker-interview/daenerys-tells-all-game-of-thrones-finale-emilia-clarke-beyonce
Detailed Description
I want to parse by fetching the html and giving to the parse instead of parser fetching the html.
Possible Solution
After looking at the code, it seem you are handling the case for browser only i.e. only if the html is provided from the browser, the proper encoding is checked from the html file. Ideally it should be able to decode the text irrespective of whether the parser is running on a browser or not
Screenshot for the parsed html
Fixed it by passing the html as Buffer
with utf-8
instead of string
as mentioned in the README
Hi @farmaan-appachhi Could you please provide an example?
I am facing pretty much the same issue.
For me problem was when I was trying to pass the local html as string. Using Buffer
fixed the issue
Mercury.parse(url, {
html: Buffer.from(html, "utf-8"),
headers: {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) " +
"Chrome/60.0.3112.113 Safari/537.36"
},
})
@farmaan-appachhi That saved my day (at least half), thank you
Could you provide the link where you found that?
I tried debugging the code. Took my 5-6 hour figure out the issue. If you see the source code, they were using Buffer when fethcing the html but local files was used just as string. That's how I figured it out
On Tue, Jul 30, 2019, 6:32 PM grigoriy-didorenko [email protected] wrote:
@farmaan-appachhi https://github.com/farmaan-appachhi That saved my day (at least half), thank you
Could you provide the link where you found that?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/postlight/mercury-parser/issues/425?email_source=notifications&email_token=AESIMEGJYBITDT7VUBYRV53QCA3XPA5CNFSM4HPSFQ5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3D4NWQ#issuecomment-516409050, or mute the thread https://github.com/notifications/unsubscribe-auth/AESIMEHW7YZFAESJT7KZXCLQCA3XPANCNFSM4HPSFQ5A .
I can verify that using Buffer works. This should have been mentioned in the README.
+1 for Buffer works. It seems that the string should not be passed in for any case. or Mercury should detect the type and handle it specially.