goread icon indicating copy to clipboard operation
goread copied to clipboard

Use task names to reduce the risk of duplicate update-feed tasks.

Open mblakele opened this issue 10 years ago • 7 comments

The task name is based on the feed id (URL). Setting this should reduce the risk of duplicate update-feed tasks (#231) because the queue shouldn't accept tasks with duplicate names. According to the docs this isn't an iron guarantee, but should work under most circumstances.

I've been running this change self-hosted for a day or so. I'm not sure whether or not it's caught any duplicates, but it doesn't seem to cause any new problems.

mblakele avatar Dec 15 '14 01:12 mblakele

The docs say:

If a push task is created successfully, it will eventually be deleted (at most seven days after the task successfully executes). Once deleted, its name can be reused.

This suggests that it can take up to 7 days to get a task name back before it's reusable. If that is true, I don't think we can use this method.

maddyblue avatar Dec 16 '14 07:12 maddyblue

Re 7 days, hence the second commit that adds NextUpdate to the task name.

mblakele avatar Dec 16 '14 08:12 mblakele

If we add in the feed update time to the task name, how does this change avoid the duplicate update problems? If the update orders described in #231 happen, would this change prevent double updating?

maddyblue avatar Dec 16 '14 08:12 maddyblue

Maybe I'm misunderstanding #231 or the update process? Here's the problem I thought I was solving:

  1. Some feed F has NextUpdate == t0
  2. UpdateFeeds selects feeds where NextUpdate <= now at t1, including feed F with NextUpdate t0.
  3. UpdateFeeds queues a task.
  4. The task queue is long or slow, or UpdateFeed takes a long time to fetch the feed URL. Meanwhile NextUpdate is still t0.
  5. UpdateFeeds runs at t2, selecting F with NextUpdate t0 and queuing it for update — again.
  6. The queue contains duplicate tasks for feed F.

With task names using F.NextUpdate and F.StringID(), the value for NextUpdate is the same in steps 2 & 5. In step 5 the task name collision prevents the duplicate task from queuing, avoiding duplicate work.

In practice I seem to see something like this in my self-hosted deployment. Here's an example of a duplicate task caught by this code, as logged: I suspect this one was a combination of slow datastore queries and slow URL fetches. First, here's the successful queue:

18:13:42.745
queuing feed 2014-12-15T01-55-10Z07-00_http_3A_2F_2Fwiki_2Exen_2Eprgmr_2Ecom_2Fxenophilia_2Fatom_2Exml

Two minutes later here's a duplicate task with the same value of NextUpdate for the same feed URL.

18:15:43.281 error for task &{/tasks/update-feed [102 101 101 100 61 104 116 116 112 37 51 65 37 50 70 37 50 70 119 105 107 105 46 120 101 110 46 112 114 103 109 114 46 99 111 109 37 50 70 120 101 110 111 112 104 105 108 105 97 37 50 70 97 116 111 109 46 120 109 108] map[Content-Type:[application/x-www-form-urlencoded]] POST
     2014-12-15T01-55-10Z07-00_http_3A_2F_2Fwiki_2Exen_2Eprgmr_2Ecom_2Fxenophilia_2Fatom_2Exml 0 0001-01-01 00:00:00 +0000 UTC 0 <nil>}: taskqueue: task has already been added

But the update finally happened and there were no further errors:

18:15:43.177 update feed http://wiki.xen.prgmr.com/xenophilia/atom.xml
18:15:44.759 hasUpdate: true, isFeedUpdated: false, storyDate: 2014-12-15 01:21:38.37879 +0000 UTC, stories: 15
18:15:44.765 0 update stories
18:15:44.765 putting 1 entities
18:15:45.133 next update scheduled for 30m16s from now

At 18:47:47.421 the same feed was updated again, this time without any duplicate tasks:

18:47:47.305 queuing feed 2014-12-15T02-46-01Z07-00_http_3A_2F_2Fwiki_2Exen_2Eprgmr_2Ecom_2Fxenophilia_2Fatom_2Exml

mblakele avatar Dec 16 '14 17:12 mblakele

I thought of another situation worth mentioning: what if we queue an update-feed task but fetchFeed fails? Maybe the server is temporarily down or unreachable, or it's returning error messages instead of good rss or atom. Will we be able to try again with a new task, or is this feed stuck forever?

When I wrote the task-name code I wasn't thinking about that. But looking at it now, I think it's ok because the existing feedError handler in UpdateFeed takes care of it. If anything goes wrong with the fetch or parse, NextUpdate becomes time.Now().Add(time.Hour * time.Duration(v)), where v is calculated based on the number of errors previously observed. So when UpdatedFeeds tries to queue a new task for this problem feed, it'll have an updated NextUpdate value and so the task name will be ok.

mblakele avatar Dec 16 '14 21:12 mblakele

Ok, I've had a chance to look over this fully now. I don't particularly care about human-readable task names, especially since the feed URL is already embedded in the task data. I'd prefer to use base64.URLEncoding for the names because I think it's safer and better tested.

The main part of the solution here is to append the feed next update time to the task name. This means we can no longer do a keys-only query (a small ops operation, which are free) to a read operation (not-free). I think we can keep the keys-only query and still achieve this. Obviously, this means we can no longer depend on the feed next update time. I believe we should instead use UpdateMin in settings.go like so: time.Now().Truncate(UpdateMin). Since we never want to update more than once per UpdateMin, this will work most of the time. It will still not work when that boundary is crossed, but it's much better than we have no with no additional cost. Although because of this boundary problem, I'm not convinced this is a good idea. Will ponder.

maddyblue avatar Dec 23 '14 09:12 maddyblue

Looking at the log messages I posted above, would using UpdateMin have helped? Also consider a scenario where many feeds are responding slowly and so the queue is backed up.

For me the read operation is worth the cost: it's staying well under 20% of the free quota. It may be a question of which is more expensive: the datastore reads or the extra tasks. But it might also be worth considering something based on TCP/IP timeouts and the settings from cron.yaml and queue.yaml. Another option might be to keep task metadata in memcache, if that's cheaper than using the datastore.

mblakele avatar Dec 23 '14 16:12 mblakele