readability
readability copied to clipboard
Some sites that don't work well: Medium, Al Jazeera
Medium regularly retrieves no images, and sometimes the article is cut off near the end.
e.g.: https://medium.com/@erikdkennedy/7-rules-for-creating-gorgeous-ui-part-1-559d4e805cda
Al Jazeera doesn't get the title.
e.g.: http://www.aljazeera.com/news/2015/03/isil-fighters-bulldoze-ancient-assyrian-palace-iraq-150305195222805.html
First off, thanks @luin for this awesome library. I've tried a bunch of others and this has been the best 👍.
I'm also finding some sites that don't work well. I can confirm that images are still broken with the above Medium link and some titles don't work.
I'm also having problems with articles from the New York times (Eg. http://www.nytimes.com/2016/04/17/business/economy/san-francisco-housing-tech-boom-sf-barf.html)
A lot of content is cut out, and links that were hidden become shown.
I was wondering if there's any guidance for debugging broken sites and creating fixes. Would love to contribute back to this and not rely on you to debug every site that isn't perfect 😉.
Thanks again @luin !
Also noticing that no blogspot articles work :/
Eg. http://devopsanywhere.blogspot.com/2016/04/what-if-cli-tools-had-restful-apis.html
(There's a lot of nested div which goes contrary with looking for shallow blocks of content.)
Engadget has similar issues.
I'm happy to spend some time fixing this if folks have ideas on approaches. I'm looking into it now, but haven't made too much progress thus far.
Removing both script
and noscript
tags helps for Engadget, but we still only get a partial article back, because articles (though they sit within an article
tag block) have deeply nested article blocks.
I think the solution in this case (and a few others) is to figure out the issue where pieces of articles are not contained "shallowly" within one element.
@mhamann
I've been having the same issues with some sites. My approach was to wrap readability
with a Cheerio based pre-processor.
Some things that I do with that pre-processor:
- Remove nodes. Readability has a publisher guideline with some extract rules for removing things. https://www.readability.com/developers/guidelines#publisher
- Remove divs that match specific regexs.
- Remove hidden nodes (eg. nodes with
display: none
) -
Replace with children. Basically removing intermediate nodes (things like
.container
) and making the DOM more shallow. - Replace with children for Divs matching a certain regex
- Insert missing paragraph tags. (eg.
<div>Hello World</div>
-><div><p>Hello World</p></div>
) -
Merge nodes. (eg.
<article>Hello</article><article>World</article>
-><article>Hello World</article>
)
Bold things are what you might find most helpful.
I also added a system for regression tests that you can use to quickly test different pages.
- Go to the broken site. Let it load.
- Open DevTools and run
copy(document.documentElement.outerHTML)
- Create a new file under
/test/fixtures
(https://github.com/luin/readability/tree/master/test/fixtures). - Paste the copied DOM. Save the file as
<name_of_website>.html
- Open
/test/article-tests.js
(https://github.com/luin/readability/blob/master/test/article-tests.js) - There's an array of test objects (https://github.com/luin/readability/blob/master/test/article-tests.js#L9-L20). Add one for the html file you just created.
-
fixture
: Set to the name of the file you created -
title
: The expected title of the page -
include
: Array of strings you expect should be in the output. -
notInclude
: Array of strings you expect to be removed in the output.
- Run
npm test
That's how I test in my fork and I find it A LOT faster then manual testing. Once that test is passing it's also there for eternity and will make sure nobody breaks that site again.
Hope that helps :).
Thanks, @haroldtreen! Do you have an example of your pre-processor that you wouldn't mind sharing? These are very helpful tips.
Glad that helps @mhamann :)
These are the rules I supply:
Unfortunately it's all in a private repo at the moment, but I'll be trying to transfer more fixes into readability and out of my wrapper.
I also have a much more comprehensive set of regression tests. Would almost be good to open source the test suite so that content extractors can be compared. There's been some work on this, but could probably be updated...
Very nice--would be awesome to pull more of this into the open source project, as you said.
I'm working on a project that requires really good content extraction (https://epub.press), so that's how its come about :).
I've been accepted into the Recurse Center in September and will have 3 months to work on Open Source content extraction magic 👍. It will happen!
I'm working currently on a project too where I use node-readability and run into the same issues. To fix it I started a very similar approach like @haroldtreen and came up with a preprocessor that I hook into node-readability's preprocess function.
I just wanted to jump in and provide a link to my preprocessor file in case someone needs some code as a starter for some preprocessor logic. It's very early, and I'm still experimenting, but for instance, Engadget works already nicely, including full text and the article image. Here is the proof :), if you want to have a look:
http://purecontentproxy.azurewebsites.net/document?uri=https://www.engadget.com/2016/11/11/nintendos-mini-nes-is-out-today/
Now besides the easy stuff for removing unnecessary elements, I think the following parts are most interesting, because it took me some time to get this working correctly with Cheerio:
- It can merge nested DIVs into a single DIV
- It can merge multiple DIV siblings (on the same level) into their single container DIV
- Both above only support DIVs now (as this was relevant for Engadget and others), but it should be easy to support nested ARTICLE elements, for instance, too
- It can handle lazy image loading that sites like IGN apply, where images are only loaded when the user scrolls down and have a placeholder src value until then
As I said - heavily work in progress and not customizable yet like the preprocessor Harold built - but I hope someone will find it useful. When this gets more mature, I'm also happy to help to get some preprocessor logic into this repository.
I have been using node-readability for quite some time now. I am having problem with this particular site and I don't know why. http://abpnews.abplive.in/bollywood/abhishek-proud-of-aishwarya-in-ae-dil-hai-mushkil-491527/
I am getting the following output.
Access Denied
You don't have permission to access "http://abpnews.abplive.in/bollywood/abhishek-proud-of-aishwarya-in-ae-dil-hai-mushkil-491527/" on this server. Reference #18.621b07b.1479978747.cf1daa
Any help?
@saivishal1996 It seems they have some detection of the used User-Agent
header and do not return the content of their site if they suspect it is not a browser requesting their page.
You can work around that by specifiying a User-Agent
header like Google Chrome ~~in readability.js
- do this in the function read
~~ Edit: forget that, you can easily pass the desired headers for the HTTP request into Readability:
readability(
targetUrl,
{
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
}
},
function(error, article, meta) {
// callback
});
~~In general I think it would be a good idea for this library to send a User-Agent
header with every request that mimics a request from a widely used browser. What does everybody else think about this?~~
With the possibility to pass the headers in, it might be better if all users would specify their own User-Agent string, deciding themselves if they want to play nice or pass a spoof agent string to make more sites work.
@codekoenig Thank you for the response. It worked.
I've open sourced the preprocessor I use on EpubPress.
You can find it here: https://github.com/haroldtreen/epub-press/blob/master/lib/content-extractor.js#L28
I find it works really well for making sites behave with readability
.
Hope that helps!