readability icon indicating copy to clipboard operation
readability copied to clipboard

Strange encoding when scraping Inc.com

Open OKNoah opened this issue 10 years ago • 6 comments

When scraping this link I get data like the string below: http://www.inc.com/gene-marks/the-one-way-to-tell-if-you-re-a-successful-entrepreneur.html?cid=sf01001

<div>���z�8�0��N��Yݲ�M|��ؽ|J�i���t�^�~�P$$1�H
I�Vw�y�m�ߵ|���dW&lt;�d�03�L'" �B�PU�*���������c֏���W���^�[�G�p�^����ݭ���Wonoo���L����p���wRO^�������{�������������YW廝T{]��߿�Y����h��w;�(v����s�?{�:�
��[q,����[�w�[|��_������n/U5��Ad�"6'
9�������0b��1�|v�]��tٯ��pf��ȲxvG.;�����Q��b'��C|U�����d�9��lZ�3��4q/ڭ���gcĢ&gt;a�    �F��H��кX0`��L�Ȝ���
g��+�{�
�uǵZM����Vau�[��

Any guess why or how to convert it to something readable?

OKNoah avatar Mar 17 '15 20:03 OKNoah

I think this is caused by JSDOM not decrypting a compressed page.

Related to this: https://github.com/tmpvar/jsdom/issues/648

OKNoah avatar Mar 26 '15 22:03 OKNoah

The result seems correct for me. The code:

read('http://www.inc.com/gene-marks/the-one-way-to-tell-if-you-re-a-successful-entrepreneur.html?cid=sf01001',
function(err, article, meta) {
  console.log(article.content);
});

And output:

<div class="article-body inc_editable" data-editable="true" data-editor-class="InlineTextEditor" data-label="Body" data-content-type="article" data-content-id="59175" data-fieldname="inc_clean_text"><p>I'm not a very successful entrepreneur.&nbsp; Are you?</p><span>...

luin avatar Mar 28 '15 15:03 luin

Ah, OK I think the problem was I scraped the data with JSDOM then used readability on the HTML.

OKNoah avatar Mar 30 '15 23:03 OKNoah

Nope, I still see the results above.

I now use Needle to decompress the page first, and that works.

OKNoah avatar Apr 06 '15 23:04 OKNoah

Yep using Needle to get the content works better. Here is the link: https://github.com/tomas/needle

And a code example:

var needle = require('needle');
var read = require('node-readability');

var url = "http://www.inc.com/gene-marks/the-one-way-to-tell-if-you-re-a-successful-entrepreneur.html?cid=sf01001";
needle.get(url, function(error, response) {
  if (!error && response.statusCode == 200){

        read(response.body,
        function(err, article, meta) {
          var dom = article.document;
          var html = '<html><head><meta charset="utf-8"><title>'+dom.title+'</title></head><body><h1>'+article.title+'</h1>'+article.content+'</body></html>';
          console.log(html);
        });
  }
});

chrisribe avatar Sep 01 '15 02:09 chrisribe

When I use this code, I don't get absolute image url. I think when response.body is passed to readability it doesn't understand the main url. How to solve it?

cuecusp avatar Nov 12 '16 18:11 cuecusp