nom
nom copied to clipboard
Unable to read individual articles for Atom and RSS 1.0 feeds
nom
successfully lists the feed items, but attempting to read an individual article only shows the title and date:
Moscow on the Med: A Faraway War Transforms a Turkish Resort Town
2022-12-29 10:00:26 +0000 UTC
Tested with https://rss.nytimes.com/services/xml/rss/nyt/World.xml
Doesn't work for this RSS v1.0 feed as well - http://feeds.bbci.co.uk/news/rss.xml
Hmm, this is an interesting case. These feeds are just links to the articles without containing the content. We could attempt to fetch the content from the URL but the markdown conversion from a full page html is likely to be funky.
One option here would be to just open the links that have no content in the browser using xdg-open
or similar.
Would that suit your usecase? I can try and look at parsing but this opens up a large can of worms, even the times site in your example requires a times account to actually get the content.
Hi @guyfedwards
We could attempt to fetch the content from the URL but the markdown conversion from a full page html is likely to be funky.
Yes, it might be a bit of a challenge. You'll want to find a library to strip script
, style
, and noscript
tags, and another to convert html to markdown. I see you're already making use of html-to-markdown.
One option here would be to just open the links that have no content in the browser using xdg-open or similar. Would that suit your usecase?
Not quite. I'm interested in reading the article in the terminal via the RSS reader without leaving the app.
...even the times site in your example requires a times account to actually get the content.
I'm able to get the article content via w3m https://www.nytimes.com/live/2022/12/29/world/russia-ukraine-news
. Does this work for you?
Well, w3m
is a complete browser, surely it does display an article as it practically renders all the elements.
Fetching ONLY an article from a web page is a little tricky. I'm not sure if there is any good way of scraping only the article contents from HTML, stripping out headers, navigations, sidebars, footer etc as the structure of a web page containing an article isn't standardized. In fact, I can build a whole page with any of the blocks out of styled div
elements (not even classes). Not only that many pages would container skeletons for things like modal windows, or hidden content that becomes visible on click (or other user interaction with the interface), etc. Some will also have full content visible only with a subscription.
Returning back to the solution on how to fetch article contents from a webpage... One way I can think of finding an element with the highest word density (after all the tags are stripped) maybe?
All in all, this is quite a hefty task. If something like this is implemented that'd be great. I personally have some feeds like that in my newsboat config, and those are displayed with only a link to the article.
I think short-term, opening the link is a sufficient solution, longer term we can look at adding html parsing capability but will be a bit more of a challenge.
The circumflex program allows you to read Hacker News articles in the terminal. They accomplish that by using the Go-Readability package to find the main readable content and the metadata from a HTML page
. After that comes some post-processing to turn it into markdown that can be displayed in the terminal (relevant code). My understanding is that this is not 100% but anecdotally it seems to work for most articles that get posted on HN.
Maybe a similar approach would work here?