news icon indicating copy to clipboard operation
news copied to clipboard

Autopurging does not delete unread news items

Open fabolhak opened this issue 7 years ago • 46 comments

IMPORTANT

Read and tick the following checkbox after you have created the issue or place an x inside the brackets ;)

Explain the Problem

Old feed data items are not getting purged automatically. Since I don't mark all items as read the database grows over time and becomes very big.

Steps to Reproduce

Clone Docker Image from official repo, install news app and install/configure cron.

  1. add a news feed with a big number of unread news items (e.g. http://usa.chinadaily.com.cn/china_rss.xml)
  2. set the autopurge limit ("Maximum read count per feed") to a lower value (e.g. 10)
  3. wait for the next cron update (or manually trigger)

Expected Behavior

The number of unread news items should be 10 and should not increase over time.

Actual Behavior

The number of unread news items is higher than 10 and increases when new items are being published.

System Information

  • News app version: 10.1.0
  • Nextcloud version: 11.0.1
  • PHP Version: 5.6.30
  • PHP Memory Limit: 512.0 MB
  • PHP Max Execution Time: 3600
  • PHP Upload max size: 511.0 MB
  • Database and version: sqlite3 (3.8.10)

Contents of nextcloud/data/nextcloud.log

nothing relevant

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

fabolhak avatar Mar 11 '17 23:03 fabolhak

similiar to this: https://github.com/owncloud/news/issues/1024

fabolhak avatar Mar 11 '17 23:03 fabolhak

The number is added to the maximum number of entries the feed served at some point to prevent read items reappearing as unread

BernhardPosselt avatar Mar 12 '17 00:03 BernhardPosselt

Hi, thanks for the fast reply. :) I'm not sure if I understand the functionality of the "Maximum read count per feed" correctly.

So basically you are saying when a feed has the 100 most recent items (number of feed items is always 100, but feed items change over time) and the "Maximum read count per feed" is 10, then this two numbers add up? So in the nextcloud news app there should never be more than 110 items?

fabolhak avatar Mar 12 '17 08:03 fabolhak

Exactly :)

You can view the number in the oc_news_feeds table (articles_per_update)

The logic itself is covered by an integration test so I'm pretty sure that it works (though there's still a chance it doesn't xD)

BernhardPosselt avatar Mar 12 '17 09:03 BernhardPosselt

See https://github.com/nextcloud/news/blob/master/lib/Db/ItemMapper.php#L278

BernhardPosselt avatar Mar 12 '17 09:03 BernhardPosselt

Ok cool. I also investigated the code a little bit. However my SQL and php is a bit rusty.😅

In my current owncloud instance I have this problem too. For example the China Daily Feed has an "articles_per_update" of 300. The maximum read count is configured to 60. However, when I check the database I can find over 3800 items with the feed id of China daily.

The owncloud news app is a bit outdated though. I also pulled the newest nextcloud docker image and installed the news app (see above). If the problem still exists I will report back in a few days :).

fabolhak avatar Mar 12 '17 14:03 fabolhak

Hey, so the issue still exists in the nextcloud version of the news app. I subscribed to the China daily feed which has an "articles_per_update" of 300. My maximum read count is configured to 10. Today I had more than 310 unread items in the news app.

I also had a look at the code. What I don't understand is this line: https://github.com/nextcloud/news/blob/master/lib/Db/ItemMapper.php#L275

The $status seems to be the status which is excluded from the auto purge process. However, also unread items are excluded because of the | StatusFlag::UNREAD. This causes the auto purge process to exclude unread items as well and we have the described behavior above (database becomes very large over time).

When I uncommend the later part: $status = StatusFlag::STARRED /*| StatusFlag::UNREAD*/;

The autopurge process seems to work and the number of China daily items is constantly at 310.

I'm not sure if this is intended or not. But with the | StatusFlag::UNREAD the autopurge process only deletes read items which are over the "articles_per_update"+"maximum read count" limit.

fabolhak avatar Mar 13 '17 10:03 fabolhak

I have several feeds that get updated often, so they show "999+" after some time. This lags the database quite heavily because the old feed items don't get purged. I logged this using the MYSQL slow query log as mentioned in the Nextcloud developer manual and here is the output of it:

# Time: 2017-06-28T12:40:39.210304Z
# User@Host: owncloud[owncloud] @ localhost []  Id:  2768
# Query_time: 229.221434  Lock_time: 0.000387 Rows_sent: 0  Rows_examined: 54006
SET timestamp=1498653639;
SELECT COUNT(*) AS size FROM `oc_news_items` `items` JOIN `oc_news_feeds` `feeds` ON `feeds`.`id` = `items`.`feed_id` AND `feeds`.`deleted_at` = 0 AND `feeds`.`user_id` = 'alfred' AND ((`items`.`status` & 4) = 4)LEFT OUTER JOIN `oc_news_folders` `folders` ON `folders`.`id` = `feeds`.`folder_id` WHERE `feeds`.`folder_id` = 0 OR `folders`.`deleted_at` = 0;
# Time: 2017-06-28T12:40:39.211512Z
# User@Host: owncloud[owncloud] @ localhost []  Id:  2765
# Query_time: 229.177449  Lock_time: 0.000232 Rows_sent: 0  Rows_examined: 54006
SET timestamp=1498653639;
SELECT COUNT(*) AS size FROM `oc_news_items` `items` JOIN `oc_news_feeds` `feeds` ON `feeds`.`id` = `items`.`feed_id` AND `feeds`.`deleted_at` = 0 AND `feeds`.`user_id` = 'alfred' AND ((`items`.`status` & 4) = 4)LEFT OUTER JOIN `oc_news_folders` `folders` ON `folders`.`id` = `feeds`.`folder_id` WHERE `feeds`.`folder_id` = 0 OR `folders`.`deleted_at` = 0;
# Time: 2017-06-28T12:40:39.214785Z
# User@Host: owncloud[owncloud] @ localhost []  Id:  2756
# Query_time: 280.522846  Lock_time: 0.000236 Rows_sent: 0  Rows_examined: 64214
SET timestamp=1498653639;
SELECT COUNT(*) AS size FROM `oc_news_items` `items` JOIN `oc_news_feeds` `feeds` ON `feeds`.`id` = `items`.`feed_id` AND `feeds`.`deleted_at` = 0 AND `feeds`.`user_id` = 'alfred' AND ((`items`.`status` & 4) = 4)LEFT OUTER JOIN `oc_news_folders` `folders` ON `folders`.`id` = `feeds`.`folder_id` WHERE `feeds`.`folder_id` = 0 OR `folders`.`deleted_at` = 0;
# Time: 2017-06-28T12:40:39.219031Z
# User@Host: owncloud[owncloud] @ localhost []  Id:  2959
# Query_time: 66.927684  Lock_time: 0.000692 Rows_sent: 0  Rows_examined: 11211
SET timestamp=1498653639;
SELECT COUNT(*) AS size FROM `oc_news_items` `items` JOIN `oc_news_feeds` `feeds` ON `feeds`.`id` = `items`.`feed_id` AND `feeds`.`deleted_at` = 0 AND `feeds`.`user_id` = 'alfred' AND ((`items`.`status` & 4) = 4)LEFT OUTER JOIN `oc_news_folders` `folders` ON `folders`.`id` = `feeds`.`folder_id` WHERE `feeds`.`folder_id` = 0 OR `folders`.`deleted_at` = 0;
# Time: 2017-06-28T12:40:39.247040Z
# User@Host: owncloud[owncloud] @ localhost []  Id:  2993
# Query_time: 411.752023  Lock_time: 0.000392 Rows_sent: 0  Rows_examined: 64217
SET timestamp=1498653639;
SELECT `feeds`.*, COUNT(`items`.`id`) AS `unread_count` FROM `oc_news_feeds` `feeds` LEFT OUTER JOIN `oc_news_folders` `folders` ON `feeds`.`folder_id` = `folders`.`id` LEFT JOIN `oc_news_items` `items` ON `feeds`.`id` = `items`.`feed_id` AND (`items`.`status` & 2) = 2 WHERE `feeds`.`user_id` = 'alfred' AND (`feeds`.`folder_id` = 0 OR `folders`.`deleted_at` = 0)AND `feeds`.`deleted_at` = 0 GROUP BY `feeds`.`id`;

The only app lagging the database is basically the News app. No other queries were logged. Sometimes this results in heavy load leading to the effect that the feeds are not updated anymore. Only resetting the News app and importing a backup *.opml file solves the problem.

e-alfred avatar Jun 29 '17 11:06 e-alfred

This might be related to #191, the queries are slow because of the way status is handled.

So perhaps once that is fixed, this issue becomes a non-issue.

I'm trying to wrap my head around what is intended with the purging code here, but that needs a bit more study. It does look like there's a bug in this bit of code indeed, in that it doesn't delete unread items which then accumulate quickly on high-traffic feeds.

sonologic avatar Jun 30 '17 20:06 sonologic

Diving into it a bit deeper, it appears to me unread items are not purged by design, the assumption being that we should never throw away items that the user hasn't seen yet. The number "Maximum read count per feed" mentioned above is described in the admin for the news app as "Defines the maximum amount of articles that can be read per feed which won't be deleted by the cleanup job"

There is no mechanism that purges unread items I think.

Given experiences reported in this issue, we probably should introduce a mechanism to purge unread items if they get above a certain (configurable?) threshold. I would propose a new configuration option 'Maximum items per feed'. If a feed has more than that number of items, excess items are purged.

The default setting for 'Maximum items per feed' should be 'disabled' so that unread items are never deleted.

Example: 'Maximum items per feed' is set to 1000. Feed has 1500 items. Then the oldest 500 (determined by id) items are purged, regardless of whether they are read or unread; with the exception of starred items of course.

Just out of curiosity, how is anyone managing to keep up with feeds that have that many articles?

sonologic avatar Jun 30 '17 20:06 sonologic

@sonologic Well, I have a bunch of high volume forum feeds which I like to look through from time to time. The news app is a good place to keep them in one place, so I have to go only to one location. Also, being part of Nextcloud, I do not have to run another newsfeed application which is a big bonus.

A limit of items would be very appreciated because it is what puts a high load on the database as I showed above. It gets out of control after some time and the only way to get rid of the problem is to reset the news app and reimport the *.opml file.

Also, for people with less powerful machines, this is an even bigger problem I suppose. The news app shouldn't be a power hog just because a few feeds have high volumes of new itmes after a short time.

It also does not have to be a default setting though, just deactivate it by default and let the user decide what to do/put it in the documentation.

e-alfred avatar Jul 04 '17 09:07 e-alfred

I changed this line to $status = StatusFlag::STARRED;

This works perfectly for me. With this fix the number of feed items is limited to "articles_per_update"+"maximum read count".

fabolhak avatar Jul 04 '17 12:07 fabolhak

@fabolhak while that may be perfect for you, for a lot of other people that is far from perfect. Deleting unread items is unexpected behaviour. It must be a conscious opt-in, and not be default behaviour.

Also, probably with #191 fixed, the load on the database will no longer manifest as it does now.

sonologic avatar Jul 04 '17 16:07 sonologic

Any progress on this? I too would like to allow deleting of items that are not marked as 'read'. Today, i had trouble updating my nextcloud instance because my sqlite db had grown to >600mb, and my vps was low on space and the db couldnt be converted. I ended up doing a 'delete from oc_news_items' and a vacuum, and got the db to 33mb. Perhaps an option to keep X unread, or delete if > X days would do? i use the news app to skim through news feeds, and wouldnt ever want to go back more that a few weeks. I guess the root cause is that I use my own app on sailfish to sync from my nextcloud instance, and I never set the read status, perhaps I should set the read status of all items after ive downloaded them, but still, the db shouldnt grow to an unmanageable size.

piggz avatar Aug 04 '18 09:08 piggz

In my opinion the there is a misunderstanding in this function. Despite it is named deleteReadOlderThanThreshold which states that "it deletes read items older than the threshold" - and is, what this function actually does - i believe the author wanted to write deleteUnreadOlderThanThreshold, which is what the phpdoc explains.

Delete all items for feeds that have over $threshold unread and not starred items

A simple change from

$params = [false, false, $threshold];

to

$params = [true, false, $threshold];

should do the trick.

Update

More convenient of course would be a new function deleteUnreadOlderThanThreshold with its own configurable threshold number (or zero if anybody wants to keep the old behavior).

I think i will do a PR.

smoebody avatar Sep 20 '18 08:09 smoebody

You do not want to delete read items because you didn't read them yet. Works as intended

BernhardPosselt avatar Sep 25 '18 19:09 BernhardPosselt

It may work as intended, but could be considered a feature request. There is definitely a use case deleting old news, read or not, so as to not exceed disk space if the user wishes that.

piggz avatar Sep 25 '18 20:09 piggz

I know the app is pretty much unmaintained for quite a while, but this ever growing db is no fun. Atm. my db (lots of feeds and users though) consists of 28GB (feed items only) which is 95% of the entire nextlcoud database where I have 9709932 unread vs 702188 read items. It's clearly user fault as people just add lots of feeds and never read them. I think I will just mark everything read that is older then a month or two. Will the read items be removed from the db? Is there a way anyone though of to get rid of the (pretty much) waste at this point? I did once just removed items older then.. but that introduced a lot of issues with items re-apearing etc. Is there a way to change the articles_per_update? for instance if I were to run a query that changes all articles_per_update above (say) 100 to 50, will that get overwritten on next feed update?

Seeing how much db is used by unread items I want to put some order and purge as much items as possible to prevent the ennorous db growth without being forced to disable news app all together.

muppeth avatar Jan 31 '19 01:01 muppeth

Once I thought about normalizing feeds and feeds items

  • you would have one single list of feeds and items in stored in your database
  • every user would be subscribed to a feed by a relation in the database - User x Feed
  • users would have relations to feed items for different status like read/starred/deleted (probably we could delete the relation directly instead of using a flag)
  • when a user subscribes to an existing feed -> create the user-feed relation + create the relation for x feed items
  • when the user subscibes to a new feed create the feed and setup the relations - basically what we currently have
  • feed items would be deleted when every user read that one
  • feeds would be deleted if no one is subscribed anymore

We could have same benefits:

  • every feed entry is retrieved once and not per subscription - saves CPU, RAM, energy
  • every feed + item is stored once and just linked - less database size

Just want to put my idea in, because in this case it could help reducing the db size. Let me know what you think and if I should create a separate issue for that or if its just random nonsense.

danopz avatar Mar 18 '19 23:03 danopz

Right, it's a trade off.

Why I decided against it back in the days was simplicity: you make your data model a lot harder to use and change for a theoretical gain (assumption: people have the same feeds), especially when you can have many different versions of a feed.

Your assumption requires feed ids to be unique which doesn't hold up in practice. You can get a different version of a feed based on authentication parameters, full text feed settings, ip ranges, client headers etc. You also need to anticipate future changes.

Even migrating this stuff could take days for certain installations (e.g. 1000+ users)

BernhardPosselt avatar Mar 18 '19 23:03 BernhardPosselt

You do not want to delete read items because you didn't read them yet. Works as intended

No it doesn't. It makes absolutely no sense to keep read items, so the threshold for autopurging was always meant to be the treshold for unread items. Each comment in the code indicates that. Always there is the talk of unread items regarding autopurging. Especially the documentation for the autopurge-threshold says

Config.php:27

    private $autoPurgeCount;  // number of allowed unread articles per feed

itemService.php:250

    /**
     * This method deletes all unread feeds that are not starred and over the
     * count of $this->autoPurgeCount starting by the oldest. This is to clean
     * up the database so that old entries don't spam your db. As criteria for
     * old, the id is taken
     */
    public function autoPurgeOld()

itemMapper.php:335

    /**
     * Delete all items for feeds that have over $threshold unread and not
     * starred items
     *
     * @param int $threshold the number of items that should be deleted
     */
    public function deleteReadOlderThanThreshold($threshold)

I made a PR #506 which removes the unread condition from the Queries and its working for me. If anybody should doubt it just increase the threshold of 200 to - lets say 1000 - and you will still save more items than you ever might want - without keeping a table full of garbage.

smoebody avatar May 12 '19 16:05 smoebody

No it doesn't. It makes absolutely no sense to keep read items, so the threshold for autopurging was always meant to be the treshold for unread items

Autopurge works in the following way (even if comments are wonky, sorry but I implemented this stuff):

  • Per feed: keep a maximum number (the one you configure) of read items ordered by latest added
  • Delete all articles older than the stuff you marked as read which is not starred (basically stuff that you consumed but did not want to keep.

and you will still save more items than you ever might want - without keeping a table full of garbage.

The "garbage" that you mean is basically stuff that you did not read yet. That's why you have a feed reader. If you (don't) want to read it, mark it as read and it will be cleaned up for you automatically.

There isn't really a benefit in deleting unread items since data usage is minimal and the original use case for this feature is to not accumulate more and more data in your table that you are not interested in any more.

BernhardPosselt avatar May 12 '19 19:05 BernhardPosselt

No it doesn't. It makes absolutely no sense to keep read items, so the threshold for autopurging was always meant to be the treshold for unread items

Autopurge works in the following way (even if comments are wonky, sorry but I implemented this stuff):

* Per feed: keep a maximum number (the one you configure) of read items ordered by latest added

* Delete all articles older than the stuff you marked as read which is not starred (basically stuff that you consumed but did not want to keep.

and you will still save more items than you ever might want - without keeping a table full of garbage.

The "garbage" that you mean is basically stuff that you did not read yet. That's why you have a feed reader. If you (don't) want to read it, mark it as read and it will be cleaned up for you automatically.

There isn't really a benefit in deleting unread items since data usage is minimal and the original use case for this feature is to not accumulate more and more data in your table that you are not interested in any more.

Yes, but that is still a viable benefit .... i have had cases of my nextcloud db failing to upgrade as it has grown so large, that I have run out of space on my hosting when the db upgrade occurs, and the size was taken by unread articles. Its easy to say 'mark them as read' but that can mean endlessly scrolling through old stuff im never going to read. Im sure a single option of 'also delete unread items', defaulting to off would be good, but, in the mean time, i may try the version from the PR.

piggz avatar May 12 '19 19:05 piggz

@piggz did you see the mark all as read button? And you can trigger the cleanup (aka the cron jobs) manually using the same php command that you use to update stuff or just wait until it runs as configured.

BernhardPosselt avatar May 12 '19 19:05 BernhardPosselt

@BernhardPosselt I hadnt, although I dont use the web app, I use my own mobile app. I presume there is an api call I can make to mark all as read?

piggz avatar May 12 '19 20:05 piggz

Yep

BernhardPosselt avatar May 12 '19 22:05 BernhardPosselt

@BernhardPosselt

Autopurge works in the following way (even if comments are wonky, sorry but I implemented this stuff):

  • Per feed: keep a maximum number (the one you configure) of read items ordered by latest added
  • Delete all articles older than the stuff you marked as read which is not starred (basically stuff that you consumed but did not want to keep.

Yes, this is exactly how it works, but say: What is the use in keeping read items afterall? If you want to keep them you should star them.

The "garbage" that you mean is basically stuff that you did not read yet. That's why you have a feed reader. If you (don't) want to read it, mark it as read and it will be cleaned up for you automatically.

There isn't really a benefit in deleting unread items since data usage is minimal and the original use case for this feature is to not accumulate more and more data in your table that you are not interested in any more.

Marking items as read which i do not want to read is a workaround and not a solution. Right now items I do not want to read persist in the database and create garbage. This a fact. Many people here complain about an ever-growing table because of this issue - including me. Since my last comment several months passed, which is the time my database grew so much that i came back to this issue.

did you see the mark all as read button? And you can trigger the cleanup (aka the cron jobs) manually using the same php command that you use to update stuff or just wait until it runs as configured.

This is ridiculous. You suggest to instruct all users of my nextcloud that they have to mark all items as read whether they read it or not in order to keep my database clean. Mark all as read is useless as well since it marks all items as read - also items which i might want to read, because they are actually new items.

You should admit that at some point items are no news anymore and should be cleaned up automatically in order to keep the database sane. My PR does exactly that and it is simplifying the cleanup process as well. Please consider it.

smoebody avatar May 13 '19 07:05 smoebody

You should admit that at some point items are no news anymore and should be cleaned up automatically in order to keep the database sane. My PR does exactly that and it is simplifying the cleanup process as well. Please consider it.

RSS is not limited to news per se so your pull request would nuke stuff like podcasts and other things.

I think I understand your use case now: you basically want an incomplete stream of fresh entries that is limited to something like: "show me what's relevant for today"

That behavior is probably problematic on a global level: not every feed is strictly out of date so if at all this would have to be configured on per feed level IMHO. And then you probably want to keep stuff by date.

This is ridiculous. You suggest to instruct all users of my nextcloud that they have to mark all items as read whether they read it or not in order to keep my database clean. Mark all as read is useless as well since it marks all items as read - also items which i might want to read, because they are actually new items.

That is an entirely different issue: what about users importing tons of contacts and files? In general you need to talk to and trust your users to not abuse it. On bigger installations this is probably an issue but there you have a lot more resources available. I personally know of a setup for a finnish university which works just fine.

What could be done though is to build in quota support as in: letting users only allocate a specific maximum amount of MBs in your db and otherwise refuse to update feeds. You can also expose that using the feed warning APIs in place. That way you don't have data loss for older items and a user should be able to fix it quickly.

BernhardPosselt avatar May 13 '19 08:05 BernhardPosselt

@BernhardPosselt

RSS is not limited to news per se so your pull request would nuke stuff like podcasts and other things.

How is that? Can you explain, please.

I think I understand your use case now: you basically want an incomplete stream of fresh entries that is limited to something like: "show me what's relevant for today"

That behavior is probably problematic on a global level: not every feed is strictly out of date so if at all this would have to be configured on per feed level IMHO. And then you probably want to keep stuff by date.

What do you mean by not every feed is strictly out of date? As i said: set the items to keep to 1000 which will holds 1000 items per feed. I can not imagine a use-case where anybody would want to read the 1001st item.

That is an entirely different issue: what about users importing tons of contacts and files? In general you need to talk to and trust your users to not abuse it.

Abuse and misuse are two different things. A user has to act manually to import files and contacts. This is purpose and the user can be hold responsible for that.

A user importing 10 feeds, each generating 100 items per day (and there are such) without reading them cannot be held responsible for that, although it fills the nextcloud-database with garbage.

What could be done though is to build in quota support as in: letting users only allocate a specific maximum amount of MBs in your db and otherwise refuse to update feeds. You can also expose that using the feed warning APIs in place. That way you don't have data loss for older items and a user should be able to fix it quickly.

I think you overcomplicate things. Also you bring the responsibility back to the user which is a bad idea IMO. The user will not understand.

I suggest to keep the items in a simple FIFO of a definable length (this is already there). If items disappear to soon, than increase the length, but making a difference between read and unread items regarding whether they should be kept or not leads to confusion and misunderstanding (which is already the case). If this break how people are using the extension, please give an example. I cannot think of one.

Cheers

smoebody avatar May 13 '19 09:05 smoebody

What do you mean by not every feed is strictly out of date? As i said: set the items to keep to 1000 which will holds 1000 items per feed. I can not imagine a use-case where anybody would want to read the 1001st item.

Think of feeds containing images, videos, open issues on your customer management system (like github for instance where read means: done), error messages from logs, torrents, movie episodes and their availability, music etc. If you delete those, then you basically lose data. Therefore you can also not use a FIFO queue because you are deleting data that is not yet marked for deletion.

As an analogy: thing of a bookmarks application that just deletes your old bookmarks because you went over the global limit.

If items disappear to soon, than increase the length

You don't really know what a safe amount of unread articles is really. So people will increase it to something very big which in turn will put in much more garbage than the current implementation.

I think you overcomplicate things. Also you bring the responsibility back to the user which is a bad idea IMO. The user will not understand.

You are arguing for deleting user data rather than explaining to him, that he can only keep a limited amount and his data usage is full. Lots of software like Dropbox and the files app does that and it works just fine.

BernhardPosselt avatar May 13 '19 11:05 BernhardPosselt