node-feedsub icon indicating copy to clipboard operation
node-feedsub copied to clipboard

node-feedsub not fetching new items for several feeds

Open pro-sumer opened this issue 2 years ago • 11 comments

On a Raspberry Pi, I run a Node.js script that uses multiple node-feedsub instances, to fetch new items for several RSS feeds (every hour).

This is how I create instances for every RSS feed:

...  = new FeedSub(url, { interval: 60, maxHistory: 999, autoStart: true, emitOnStart: true })

For each feed, node-feedsub fetches all current articles, when I start the Node.js script. For some, it will also fetch updates every hour. For others, it does not fetch any updates (it reports 0 new items every hour, but there are new items - checked by looking at the affected RSS feeds manually). If I then restart the script, it will fetch all the missing articles at start, but again no updates after that.

Example of an affected feed: https://seths.blog/rss

What can I be doing wrong?

(Is there a limit to the number of instances, since some work and others don't?)

pro-sumer avatar Feb 16 '22 09:02 pro-sumer

my guess is that the <lastBuildDate> is not being updated. if you look at the feed above, the lastBuildDate is older than the newest item's pubDate. maybe I made a mistake in how I'm using lastBuildDate, but I thought it was supposed to reflect whenever there's a change, ie a new item, in the feed.

https://cyber.harvard.edu/rss/rss.html#optionalChannelElements

feedsub checks if the feed's last date is the same in order to save on some CPU and bandwidth if the feed is very big. the fix is simple if i'm using lastBuildDate incorrectly.

https://github.com/fent/node-feedsub/blob/v0.7.8/src/feedsub.ts#L255-L258

fent avatar Feb 22 '22 02:02 fent

Yes, I noticed that exact issue in this particular feed and I'm currently experimenting with the lastBuildDate check disabled.

I have more feeds that don't update at all after the initial batch; I hope to find time to investigate those as well, now that I have inspected your code (and learned a few things from doing that!).

pro-sumer avatar Feb 22 '22 12:02 pro-sumer

Another feed seems to fail because the pubDate (and lastBuildDate) field contains a date string (Tue, 22 Feb 2022 15:00:24 CET) that cannot be converted to a JavaScript Date object (it would be OK without the trailing CET I think?) causing getItemDate to return Invalid Date:

https://github.com/fent/node-feedsub/blob/v0.7.8/src/feedsub.ts#L274

The sortOrder then becomes NaN instead of a negative/positive number or zero:

https://github.com/fent/node-feedsub/blob/v0.7.8/src/feedsub.ts#L278

What can be done about this?

All these problematic feeds have relatively few entries. Would it be possible to make all these "optimisations" in node-feedsub optional and let node-newsemitter take care of only publishing new entries?

pro-sumer avatar Feb 22 '22 21:02 pro-sumer

CET? that's not part of the spec https://www.ietf.org/rfc/rfc822.txt

but maybe feedsub could have a fallback if parsing the date results in NaN

fent avatar Feb 23 '22 05:02 fent

All these problematic feeds have relatively few entries. Would it be possible to make all these "optimisations" in node-feedsub optional and let node-newsemitter take care of only publishing new entries?

for the feeds with NaN dates, try increasing the maxHistory. by default it's 10. so without being able to tell what is older than what, it'll compare some random (because sorting by NaN will be random I think?) set of 10 items, and see if any of them are not in the current history.

fent avatar Feb 23 '22 05:02 fent

I'm already using maxHistory 999 instead of 10.

I'll try to investigate a bit further later this week (either by cleaning up feeds before feeding them to feedsub or by forking/modifying feedsub, but I'm not sure yet what would be the best approach).

pro-sumer avatar Feb 23 '22 08:02 pro-sumer

I'm already using maxHistory 999 instead of 10.

has that fixed the issues with the feeds with invalid dates?

I'll try to investigate a bit further later this week (either by cleaning up feeds before feeding them to feedsub or by forking/modifying feedsub, but I'm not sure yet what would be the best approach).

i'm willing to remove the check for lastBuildDate if it's being used incorrectly. it's a very small optimization anyway, it wouldn't change behavior.

for the invalid (NaN) dates, using either null or 0 would be better, or the original date string. then at least the sorting comparison between items would be consistent

fent avatar Feb 23 '22 08:02 fent

No, 999 did unfortunately not help for the invalid dates.

pro-sumer avatar Feb 23 '22 13:02 pro-sumer

I'm currently experimenting with htmlparser2 (and feed), where I only check the guid of an item to see whether it is new or not. Skipping all the nice optimisations from feedsub seems to work better for these problematic feeds (so far).

Not blaming feedsub though, as these feeds are indeed invalid.

(and still using feedsub in other projects that luckily only work with valid feeds)

pro-sumer avatar Feb 25 '22 13:02 pro-sumer

when i google "rss lastBuildDate" i get a bunch of results about rss libraries implementing this incorrectly as per the rss spec, including wordpress. i think it's safe to ignore this field

fent avatar Feb 28 '22 00:02 fent

Hi guys, I've created a pull request to make this library more customizable. I've got exactly the same problem as you described here, and for my use case, I needed to get a value from the item by a unique key.

@fent

This changes is backward compatible, I appreciate if someone can review-merge-release https://github.com/fent/node-feedsub/pull/65

ghost avatar Sep 19 '22 15:09 ghost