readability icon indicating copy to clipboard operation
readability copied to clipboard

Some sites that don't work well: Medium, Al Jazeera

Open OKNoah opened this issue 9 years ago • 15 comments

Medium regularly retrieves no images, and sometimes the article is cut off near the end.

e.g.: https://medium.com/@erikdkennedy/7-rules-for-creating-gorgeous-ui-part-1-559d4e805cda

Al Jazeera doesn't get the title.

e.g.: http://www.aljazeera.com/news/2015/03/isil-fighters-bulldoze-ancient-assyrian-palace-iraq-150305195222805.html

OKNoah avatar Mar 06 '15 05:03 OKNoah

First off, thanks @luin for this awesome library. I've tried a bunch of others and this has been the best 👍.

I'm also finding some sites that don't work well. I can confirm that images are still broken with the above Medium link and some titles don't work.

I'm also having problems with articles from the New York times (Eg. http://www.nytimes.com/2016/04/17/business/economy/san-francisco-housing-tech-boom-sf-barf.html)

A lot of content is cut out, and links that were hidden become shown.

I was wondering if there's any guidance for debugging broken sites and creating fixes. Would love to contribute back to this and not rely on you to debug every site that isn't perfect 😉.

Thanks again @luin !

haroldtreen avatar Apr 17 '16 20:04 haroldtreen

Also noticing that no blogspot articles work :/

Eg. http://devopsanywhere.blogspot.com/2016/04/what-if-cli-tools-had-restful-apis.html

(There's a lot of nested div which goes contrary with looking for shallow blocks of content.)

haroldtreen avatar Apr 19 '16 16:04 haroldtreen

Engadget has similar issues.

I'm happy to spend some time fixing this if folks have ideas on approaches. I'm looking into it now, but haven't made too much progress thus far.

mhamann avatar Jul 13 '16 03:07 mhamann

Removing both script and noscript tags helps for Engadget, but we still only get a partial article back, because articles (though they sit within an article tag block) have deeply nested article blocks.

I think the solution in this case (and a few others) is to figure out the issue where pieces of articles are not contained "shallowly" within one element.

mhamann avatar Jul 13 '16 04:07 mhamann

@mhamann

I've been having the same issues with some sites. My approach was to wrap readability with a Cheerio based pre-processor.

Some things that I do with that pre-processor:

  • Remove nodes. Readability has a publisher guideline with some extract rules for removing things. https://www.readability.com/developers/guidelines#publisher
  • Remove divs that match specific regexs.
  • Remove hidden nodes (eg. nodes with display: none)
  • Replace with children. Basically removing intermediate nodes (things like .container) and making the DOM more shallow.
  • Replace with children for Divs matching a certain regex
  • Insert missing paragraph tags. (eg. <div>Hello World</div> -> <div><p>Hello World</p></div>)
  • Merge nodes. (eg. <article>Hello</article><article>World</article> -> <article>Hello World</article>)

Bold things are what you might find most helpful.

I also added a system for regression tests that you can use to quickly test different pages.

  1. Go to the broken site. Let it load.
  2. Open DevTools and run copy(document.documentElement.outerHTML)
  3. Create a new file under /test/fixtures (https://github.com/luin/readability/tree/master/test/fixtures).
  4. Paste the copied DOM. Save the file as <name_of_website>.html
  5. Open /test/article-tests.js (https://github.com/luin/readability/blob/master/test/article-tests.js)
  6. There's an array of test objects (https://github.com/luin/readability/blob/master/test/article-tests.js#L9-L20). Add one for the html file you just created.
  • fixture: Set to the name of the file you created
  • title: The expected title of the page
  • include: Array of strings you expect should be in the output.
  • notInclude: Array of strings you expect to be removed in the output.
  1. Run npm test

That's how I test in my fork and I find it A LOT faster then manual testing. Once that test is passing it's also there for eternity and will make sure nobody breaks that site again.

Hope that helps :).

haroldtreen avatar Jul 13 '16 14:07 haroldtreen

Thanks, @haroldtreen! Do you have an example of your pre-processor that you wouldn't mind sharing? These are very helpful tips.

mhamann avatar Jul 14 '16 03:07 mhamann

Glad that helps @mhamann :)

These are the rules I supply: image

Unfortunately it's all in a private repo at the moment, but I'll be trying to transfer more fixes into readability and out of my wrapper.

haroldtreen avatar Jul 14 '16 03:07 haroldtreen

I also have a much more comprehensive set of regression tests. Would almost be good to open source the test suite so that content extractors can be compared. There's been some work on this, but could probably be updated...

image

haroldtreen avatar Jul 14 '16 03:07 haroldtreen

Very nice--would be awesome to pull more of this into the open source project, as you said.

mhamann avatar Jul 14 '16 03:07 mhamann

I'm working on a project that requires really good content extraction (https://epub.press), so that's how its come about :).

I've been accepted into the Recurse Center in September and will have 3 months to work on Open Source content extraction magic 👍. It will happen!

haroldtreen avatar Jul 14 '16 03:07 haroldtreen

I'm working currently on a project too where I use node-readability and run into the same issues. To fix it I started a very similar approach like @haroldtreen and came up with a preprocessor that I hook into node-readability's preprocess function.

I just wanted to jump in and provide a link to my preprocessor file in case someone needs some code as a starter for some preprocessor logic. It's very early, and I'm still experimenting, but for instance, Engadget works already nicely, including full text and the article image. Here is the proof :), if you want to have a look:

http://purecontentproxy.azurewebsites.net/document?uri=https://www.engadget.com/2016/11/11/nintendos-mini-nes-is-out-today/

Now besides the easy stuff for removing unnecessary elements, I think the following parts are most interesting, because it took me some time to get this working correctly with Cheerio:

  • It can merge nested DIVs into a single DIV
  • It can merge multiple DIV siblings (on the same level) into their single container DIV
  • Both above only support DIVs now (as this was relevant for Engadget and others), but it should be easy to support nested ARTICLE elements, for instance, too
  • It can handle lazy image loading that sites like IGN apply, where images are only loaded when the user scrolls down and have a placeholder src value until then

As I said - heavily work in progress and not customizable yet like the preprocessor Harold built - but I hope someone will find it useful. When this gets more mature, I'm also happy to help to get some preprocessor logic into this repository.

codekoenig avatar Nov 14 '16 01:11 codekoenig

I have been using node-readability for quite some time now. I am having problem with this particular site and I don't know why. http://abpnews.abplive.in/bollywood/abhishek-proud-of-aishwarya-in-ae-dil-hai-mushkil-491527/

I am getting the following output.

Access Denied

You don't have permission to access "http://abpnews.abplive.in/bollywood/abhishek-proud-of-aishwarya-in-ae-dil-hai-mushkil-491527/" on this server. Reference #18.621b07b.1479978747.cf1daa

Any help?

saivishal1996 avatar Nov 24 '16 09:11 saivishal1996

@saivishal1996 It seems they have some detection of the used User-Agent header and do not return the content of their site if they suspect it is not a browser requesting their page.

You can work around that by specifiying a User-Agent header like Google Chrome ~~in readability.js - do this in the function read~~ Edit: forget that, you can easily pass the desired headers for the HTTP request into Readability:

readability(
    targetUrl,
    {
        headers: {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
        }
    },
    function(error, article, meta) {
        // callback
    });

~~In general I think it would be a good idea for this library to send a User-Agent header with every request that mimics a request from a widely used browser. What does everybody else think about this?~~

With the possibility to pass the headers in, it might be better if all users would specify their own User-Agent string, deciding themselves if they want to play nice or pass a spoof agent string to make more sites work.

codekoenig avatar Nov 24 '16 15:11 codekoenig

@codekoenig Thank you for the response. It worked.

saivishal1996 avatar Dec 01 '16 20:12 saivishal1996

I've open sourced the preprocessor I use on EpubPress.

You can find it here: https://github.com/haroldtreen/epub-press/blob/master/lib/content-extractor.js#L28

I find it works really well for making sites behave with readability.

Hope that helps!

haroldtreen avatar Nov 28 '17 03:11 haroldtreen