readability
readability copied to clipboard
Strange encoding when scraping Inc.com
When scraping this link I get data like the string below: http://www.inc.com/gene-marks/the-one-way-to-tell-if-you-re-a-successful-entrepreneur.html?cid=sf01001
<div>���z�8�0��N��Yݲ�M|��ؽ|J�i���t�^�~�P$$1�H
I�Vw�y�m�ߵ|���dW<�d�03�L'" �B�PU�*���������c֏���W���^�[�G�p�^����ݭ���Wonoo���L����p���wRO^�������{�������������YW廝T{]��߿�Y����h��w;�(v����s�?{�:�
��[q,����[�w�[|��_������n/U5��Ad�"6'
9�������0b��1�|v�]��tٯ��pf��ȲxvG.;�����Q��b'��C|U�����d�9��lZ�3��4q/ڭ���gcĢ>a� �F��H��кX0`��L�Ȝ���
g��+�{�
�uǵZM����Vau�[��
Any guess why or how to convert it to something readable?
I think this is caused by JSDOM not decrypting a compressed page.
Related to this: https://github.com/tmpvar/jsdom/issues/648
The result seems correct for me. The code:
read('http://www.inc.com/gene-marks/the-one-way-to-tell-if-you-re-a-successful-entrepreneur.html?cid=sf01001',
function(err, article, meta) {
console.log(article.content);
});
And output:
<div class="article-body inc_editable" data-editable="true" data-editor-class="InlineTextEditor" data-label="Body" data-content-type="article" data-content-id="59175" data-fieldname="inc_clean_text"><p>I'm not a very successful entrepreneur. Are you?</p><span>...
Ah, OK I think the problem was I scraped the data with JSDOM then used readability on the HTML.
Nope, I still see the results above.
I now use Needle to decompress the page first, and that works.
Yep using Needle to get the content works better. Here is the link: https://github.com/tomas/needle
And a code example:
var needle = require('needle');
var read = require('node-readability');
var url = "http://www.inc.com/gene-marks/the-one-way-to-tell-if-you-re-a-successful-entrepreneur.html?cid=sf01001";
needle.get(url, function(error, response) {
if (!error && response.statusCode == 200){
read(response.body,
function(err, article, meta) {
var dom = article.document;
var html = '<html><head><meta charset="utf-8"><title>'+dom.title+'</title></head><body><h1>'+article.title+'</h1>'+article.content+'</body></html>';
console.log(html);
});
}
});
When I use this code, I don't get absolute image url. I think when response.body is passed to readability it doesn't understand the main url. How to solve it?