nom icon indicating copy to clipboard operation
nom copied to clipboard

Unable to read individual articles for Atom and RSS 1.0 feeds

Open yonas opened this issue 2 years ago • 6 comments

nom successfully lists the feed items, but attempting to read an individual article only shows the title and date:

   Moscow on the Med: A Faraway War Transforms a Turkish Resort Town          
                                                                              
   2022-12-29 10:00:26 +0000 UTC     

Tested with https://rss.nytimes.com/services/xml/rss/nyt/World.xml

yonas avatar Dec 29 '22 21:12 yonas

Doesn't work for this RSS v1.0 feed as well - http://feeds.bbci.co.uk/news/rss.xml

yonas avatar Dec 29 '22 21:12 yonas

Hmm, this is an interesting case. These feeds are just links to the articles without containing the content. We could attempt to fetch the content from the URL but the markdown conversion from a full page html is likely to be funky.

One option here would be to just open the links that have no content in the browser using xdg-open or similar.

Would that suit your usecase? I can try and look at parsing but this opens up a large can of worms, even the times site in your example requires a times account to actually get the content.

guyfedwards avatar Dec 29 '22 23:12 guyfedwards

Hi @guyfedwards

We could attempt to fetch the content from the URL but the markdown conversion from a full page html is likely to be funky.

Yes, it might be a bit of a challenge. You'll want to find a library to strip script, style, and noscript tags, and another to convert html to markdown. I see you're already making use of html-to-markdown.

One option here would be to just open the links that have no content in the browser using xdg-open or similar. Would that suit your usecase?

Not quite. I'm interested in reading the article in the terminal via the RSS reader without leaving the app.

...even the times site in your example requires a times account to actually get the content.

I'm able to get the article content via w3m https://www.nytimes.com/live/2022/12/29/world/russia-ukraine-news. Does this work for you?

yonas avatar Dec 30 '22 00:12 yonas

Well, w3m is a complete browser, surely it does display an article as it practically renders all the elements.

Fetching ONLY an article from a web page is a little tricky. I'm not sure if there is any good way of scraping only the article contents from HTML, stripping out headers, navigations, sidebars, footer etc as the structure of a web page containing an article isn't standardized. In fact, I can build a whole page with any of the blocks out of styled div elements (not even classes). Not only that many pages would container skeletons for things like modal windows, or hidden content that becomes visible on click (or other user interaction with the interface), etc. Some will also have full content visible only with a subscription.

Returning back to the solution on how to fetch article contents from a webpage... One way I can think of finding an element with the highest word density (after all the tags are stripped) maybe?

All in all, this is quite a hefty task. If something like this is implemented that'd be great. I personally have some feeds like that in my newsboat config, and those are displayed with only a link to the article.

Nemoden avatar Jan 28 '23 07:01 Nemoden

I think short-term, opening the link is a sufficient solution, longer term we can look at adding html parsing capability but will be a bit more of a challenge.

guyfedwards avatar Jan 28 '23 10:01 guyfedwards

The circumflex program allows you to read Hacker News articles in the terminal. They accomplish that by using the Go-Readability package to find the main readable content and the metadata from a HTML page. After that comes some post-processing to turn it into markdown that can be displayed in the terminal (relevant code). My understanding is that this is not 100% but anecdotally it seems to work for most articles that get posted on HN.

Maybe a similar approach would work here?

apainintheneck avatar Sep 22 '24 00:09 apainintheneck