node-readability icon indicating copy to clipboard operation
node-readability copied to clipboard

links that read-art can not crawl

Open Tjatse opened this issue 11 years ago • 5 comments

Tjatse avatar Nov 28 '14 07:11 Tjatse

Hi!

I'm using your module in my web crawler, called Web page Content Extractor (wce), and I've recently discovered that the read-art returns with "Error: 400 Bad Request" for these URLs, however the node-readability works on these ones, without any problem. Could you please check them?

  • http://rss.feedsportal.com/c/33832/f/610117/p/1/s/64865903/sc/3/l/0L0Slongfordleader0Bie0Clife0Etimes0Eclassic0Eirish0Edesign0Ereimagined0Ein0Estyle0E10E6965490A/story01.htm
  • http://smh.com.au/sport/cycling/australian-cyclist-rory-sutherland-pulls-out-of-world-titles-20150920-gjqzur.html

mxr576 avatar Sep 20 '15 16:09 mxr576

Hi, @mxr576, thanks a lot, there is a bug of setting host on headers in req-fast, I've fixed it and put your issue as a test case under test directory, it works fine, just update read-art to latest version and try out.

Tjatse avatar Sep 21 '15 03:09 Tjatse

Thanks for the fast reaction! I was suspicious too, that this should a req-fast issue. I can confirm, that the content extraction works fine on these links now with read-art.

mxr576 avatar Sep 21 '15 05:09 mxr576

@Tjatse , for URL: http://mp.weixin.qq.com/s?__biz=MjYyMzc1Mjk4MA==&mid=400815255&idx=1&sn=d91b630394b8ba70209406bbf44b41e8&scene=0#wechat_redirect with pictures as article, the result is

<div> <strong class="profile_nickname">搞笑集中营</strong>
<p class="profile_meta"> <span class="profile_meta_value">WeiGaoXiao</span> </p>
<p class="profile_meta"> <span class="profile_meta_value">搞笑段子、搞笑视频、搞笑幽默、搞笑糗事、内涵漫画……等等搞笑的搞笑,这里是搞笑集中营,一网打尽所有的搞笑,让你天天笑哈哈哈哈哈哈哈~</span>

entertainyou avatar Feb 18 '16 09:02 entertainyou

https://medium.com/google-developers/drawing-a-rounded-corner-background-on-text-5a610a95af5 Entire artcile is not crawled

FarmaanElahi avatar Aug 14 '18 08:08 FarmaanElahi