readable
readable copied to clipboard
NY Times articles don't return full text
If you run an NY Times article through Readable like so, you'll notice that only 9 (in this case) paragraphs are captured, out of the 30 or so in the original article.
Clearly this is an upstream bug in readability-rs, not Readable, so why am I raising an issue here? Unfortunately it looks like readable-rs may be abandoned: there's been no activity from the maintainer since April of 2021, and there are a couple of trivial pull requests still outstanding from June and September of that year, which makes me think the maintainer is unlikely to show up again any time soon.
I think I've identified the bug in readability-rs: the value of this line is negated when it shouldn't be, or at least I don't think it should be. I haven't fully grokked the scoring algorithm, but when I remove the negation and test against a NY Times article it seems to extract the whole article as I would expect.
So I see three possible courses of action:
- Submit a pull request to readability-rs and see if it gets a response (this has the advantage that if I'm wrong and there's actually some good reason the algorithm is done this way, the maintainer will presumably know)
- Fork readability-rs and fix the bug
- Not bother with any of this since it seems to work okay on most sites and it is, after all, just a fun side project that you threw together for the heck of it. :)
I'll probably go ahead and submit the PR to readability-rs regardless, just in case it gets a response, but I thought I'd check with you here first since I only care about readability-rs insofar as it affects this project.
Nice summary.
I noticed that, too but I thought it was because of the paywall on New York Times.
Looking at the code your suspicion might be correct. Guess you can easily test it out by forking the repo and using the fork in your local checkout of readable and starting it with make dev.
Always here to help if you need someone to try it out.
Update on this: I tested with the modification as you suggested, and it does fix NY Times articles. However it now completely breaks Wikipedia articles (no text renders at all.) I may do some more digging to see if I can figure out why, but I now suspect I just didn't fully understand the scoring algorithm and there's actually a good reason why that line was negated.
Oh well. Not sure what to do in this case. We can try a different readability library maybe?
Good idea. I took a look around and found this one, it's significantly older but also looks to be higher-quality than the other one. For example it has a much more comprehensive test suite, and the code looks more comprehensible at a glance.
I tested on my fork and it seems to work well (by which I mean none of the pages I looked at are obviously broken.)
It would require a little extra work to get fully up and running: loyd/readability.rs doesn't extract page metadata, so we would need to do that somewhere else.
Nice find indeed! It's under MIT license. If you want we can create an organization and fork the repo to put it back in shape (update the dependencies, Rust 2018 edition, clippy lints, CI,....). I could make you a co-maintainer. Would you be interested? Alternatively you could fork it in your own namespace and create a PR to make it a dependency of readable. Whatever you prefer. 😉
Org sounds good, that way if I disappear in a puff of smoke you retain access to it and won't have to fork again. :stuck_out_tongue:
FWIW I did also discover Paperoni which includes its own port of mozilla-readability. It looks very full-featured but unfortunately is not exposed in library form. The author has mentioned in a couple of issues that he plans to release it as a standalone lib eventually, but the last mention of that was in late 2021 and it doesn't look like there's been any activity since then.
Just forking loyd/readability.rs seems more approachable, and I don't think it would be too hard to add metadata extraction since that's mostly just looking for a few specific meta tags.
Works for me. To be honest I didn't like the metadata handling of the current library anyway. It is too opinionated. I'd rather like the metatags (the title, mostly) to be returned unaltered. I'm also curious how loyd/readability.rs handles code blocks. The current library modifies them (indentation and all) and that's another thing I'd love to see fixed. Let's see how it goes.
Moved to org now and gave you full permissions.
Great initiative! :)
Thanks to the latest changes by @jfmontanaro, I think it's much better now and we can close this, right?