node-readability
node-readability copied to clipboard
Special Chars getting butchered
Is there way to specify encoding? Or does it already default to something during the parsing. I'm using node-readability to parse lots and lots of blog pages (from a list of permalinks) but I'm getting lots of special chars (a manner of quotes, hyphens, etc.) that are transforming into messy jumbles (eg. ’)
The html parser accepts html source as string, which means it doesn't need to worry about text encoding because a javascript string is expected to contain valid utf-8 data. Could you please provide a test case to reproduce the issue?
If you need to deal with data with encodings other than utf-8, check out node-iconv https://github.com/bnoordhuis/node-iconv
I actually stepped back a level and handled all the encodings after I do the readability parsing since I need to pick out certain elements anyways. Thanks for the answer and the explanation though.
Can you post an example? I'm seeing this too.
I'm seeing an issue with htmlentities. Using the clean-proxy example, try to open this article: http://www.openforum.com/idea-hub/topics/marketing/article/what-we-can-learn-from-justin-bieber-guy-kawasaki
” converts to ” .. and similarly other entities too.