node-readability Special Chars getting butchered

Special Chars getting butchered

Open erskingardner opened this issue 14 years ago • 4 comments

Is there way to specify encoding? Or does it already default to something during the parsing. I'm using node-readability to parse lots and lots of blog pages (from a list of permalinks) but I'm getting lots of special chars (a manner of quotes, hyphens, etc.) that are transforming into messy jumbles (eg. â€™)

Dec 14 '10 12:12 erskingardner

The html parser accepts html source as string, which means it doesn't need to worry about text encoding because a javascript string is expected to contain valid utf-8 data. Could you please provide a test case to reproduce the issue?

If you need to deal with data with encodings other than utf-8, check out node-iconv https://github.com/bnoordhuis/node-iconv

Dec 15 '10 07:12 arrix

I actually stepped back a level and handled all the encodings after I do the readability parsing since I need to pick out certain elements anyways. Thanks for the answer and the explanation though.

Dec 15 '10 07:12 erskingardner

Can you post an example? I'm seeing this too.

Feb 12 '11 06:02 darkhelmet

I'm seeing an issue with htmlentities. Using the clean-proxy example, try to open this article: http://www.openforum.com/idea-hub/topics/marketing/article/what-we-can-learn-from-justin-bieber-guy-kawasaki

” converts to ” .. and similarly other entities too.

Feb 16 '11 13:02 adeelraza

node-readability node-readability copied to clipboard

Special Chars getting butchered

node-readability
node-readability copied to clipboard