parser icon indicating copy to clipboard operation
parser copied to clipboard

Invalid decoded text

Open farmaan-appachhi opened this issue 5 years ago • 8 comments

  • Platform: Mac
  • Mercury Parser Version: 2.1.0
  • Node Version (if a Node bug): 10
  • Browser Version (if a browser bug):

Expected Behavior

Parsed HTML should be properly encoded as per the original text

Current Behavior

The parsed html contains in invalid text .Might be because of decoding issue.

Steps to Reproduce

  • Fetch the html using any client
  • Pass that to the parse using Mercury.parse(url,{html:fetchedHtml})
  • Returned HTML contains incorrectly decoded text

Some Links: https://www.newyorker.com/culture/the-new-yorker-interview/daenerys-tells-all-game-of-thrones-finale-emilia-clarke-beyonce

Detailed Description

I want to parse by fetching the html and giving to the parse instead of parser fetching the html.

Possible Solution

After looking at the code, it seem you are handling the case for browser only i.e. only if the html is provided from the browser, the proper encoding is checked from the html file. Ideally it should be able to decode the text irrespective of whether the parser is running on a browser or not

farmaan-appachhi avatar May 24 '19 19:05 farmaan-appachhi

Screenshot for the parsed html

image

farmaan-appachhi avatar May 24 '19 19:05 farmaan-appachhi

Fixed it by passing the html as Buffer with utf-8 instead of string as mentioned in the README

farmaan-appachhi avatar May 24 '19 20:05 farmaan-appachhi

Hi @farmaan-appachhi Could you please provide an example?

I am facing pretty much the same issue.

grigoriy-didorenko avatar Jul 30 '19 12:07 grigoriy-didorenko

For me problem was when I was trying to pass the local html as string. Using Buffer fixed the issue

Mercury.parse(url, {
        html: Buffer.from(html, "utf-8"),
        headers: {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) " +
                "Chrome/60.0.3112.113 Safari/537.36"
        },
    })

FarmaanElahi avatar Jul 30 '19 12:07 FarmaanElahi

@farmaan-appachhi That saved my day (at least half), thank you

Could you provide the link where you found that?

grigoriy-didorenko avatar Jul 30 '19 13:07 grigoriy-didorenko

I tried debugging the code. Took my 5-6 hour figure out the issue. If you see the source code, they were using Buffer when fethcing the html but local files was used just as string. That's how I figured it out

On Tue, Jul 30, 2019, 6:32 PM grigoriy-didorenko [email protected] wrote:

@farmaan-appachhi https://github.com/farmaan-appachhi That saved my day (at least half), thank you

Could you provide the link where you found that?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/postlight/mercury-parser/issues/425?email_source=notifications&email_token=AESIMEGJYBITDT7VUBYRV53QCA3XPA5CNFSM4HPSFQ5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3D4NWQ#issuecomment-516409050, or mute the thread https://github.com/notifications/unsubscribe-auth/AESIMEHW7YZFAESJT7KZXCLQCA3XPANCNFSM4HPSFQ5A .

FarmaanElahi avatar Jul 30 '19 13:07 FarmaanElahi

I can verify that using Buffer works. This should have been mentioned in the README.

csotiriou avatar Aug 09 '19 06:08 csotiriou

+1 for Buffer works. It seems that the string should not be passed in for any case. or Mercury should detect the type and handle it specially.

ttimasdf avatar Sep 23 '20 10:09 ttimasdf