ReadabiliPy icon indicating copy to clipboard operation
ReadabiliPy copied to clipboard

ReadabiliPy vs Readability.js

Open kjoshi opened this issue 5 years ago • 2 comments

Apologies if this is a stupid question, since I've not had a proper read through the source of ReadabiliPy or Readability.js, but is the pure-python implementation of ReadabiliPy intended to exactly reproduce the results of Readability.js?

In other words, should I get the exact same results when calling:

readabilipy.simple_json_from_html_string(html, use_readability=False) and
readabilipy.simple_json_from_html_string(html, use_readability=True) ?

Because for certain articles I find that ReabiliPy gives me extra html elements and text that I'm not at all interested in, for example:

> import requests
> from readabilipy import simple_json_from_html_string

> url = 'https://analytics.jiscinvolve.org/wp/2019/02/12/my-algorithmic-friend-by-andrew-cormack/'
> html = requests.get(url).text
> article = simple_json_from_html_string(html, use_readability=False)
> article['plain_text']
...
{'text': 'If you have comments on the draft Wellbeing Analytics Code of Practice, please...'}
...
{'text': 'Archives'},
 {'text': '* July 2019, * June 2019, * February 2019, * December 2018, * November 2018, ........'}
...

whereas Readability.js manages to avoid extracting all of those links in the side bar:

> article = simple_json_from_html_string(html, use_readability=True)
> article['plain_text']
...
{'text': 'If you have comments on the draft Wellbeing Analytics Code of Practice, please...'}
<end>

Is there anything I can do to get ReadabiliPy to give me results more like Readability.js, since I'd like to use ReadabiliPy inside an AWS Lambda function and would like to avoid using both node and python (if that's even possible in a single function..?)

Thanks

(Hi @jemrobinson - small world..!)

kjoshi avatar Jul 31 '19 17:07 kjoshi

Hi @kjoshi!

No, it's not meant to be identical.

The original idea was that this would just be a python wrapper around Readability.js, and you can still use it as that if you want to. However, we found that sometimes Readability.js gives HTML that doesn't strictly adhere to the standard (although it renders in browsers without issue). The downstream application that we're using this package for cares more about that aspect so we focused on that.

We are (were?) planning to work on getting them to be feature equivalent (if not completely identical) but we haven't got much budget for that at the moment.

I think that Readability.js uses some complex heuristics to decide which part of the page to pull out as the main content element and we haven't had a chance to look into that. If you're interested in doing so, you can try diving into the Javascript to work out what it's doing...

PS. Whereabouts are you working these days?

jemrobinson avatar Aug 01 '19 17:08 jemrobinson

Ok, great, thanks for confirming.

I had a quick look at the Readability.js code but it was a bit more complicated than I assumed it would be, and I don't have enough time to go through it in detail at the moment so I'm just going to stick with your ReadabiliPy wrapper for now.

PS. I'm currently a Data Science Developer at Jisc - still based in Manchester

kjoshi avatar Aug 14 '19 08:08 kjoshi