goread
goread copied to clipboard
Use task names to reduce the risk of duplicate update-feed tasks.
The task name is based on the feed id (URL). Setting this should reduce the risk of duplicate update-feed tasks (#231) because the queue shouldn't accept tasks with duplicate names. According to the docs this isn't an iron guarantee, but should work under most circumstances.
I've been running this change self-hosted for a day or so. I'm not sure whether or not it's caught any duplicates, but it doesn't seem to cause any new problems.
The docs say:
If a push task is created successfully, it will eventually be deleted (at most seven days after the task successfully executes). Once deleted, its name can be reused.
This suggests that it can take up to 7 days to get a task name back before it's reusable. If that is true, I don't think we can use this method.
Re 7 days, hence the second commit that adds NextUpdate to the task name.
If we add in the feed update time to the task name, how does this change avoid the duplicate update problems? If the update orders described in #231 happen, would this change prevent double updating?
Maybe I'm misunderstanding #231 or the update process? Here's the problem I thought I was solving:
- Some feed F has NextUpdate == t0
-
UpdateFeeds
selects feeds whereNextUpdate <= now
at t1, including feed F with NextUpdate t0. -
UpdateFeeds
queues a task. - The task queue is long or slow, or
UpdateFeed
takes a long time to fetch the feed URL. Meanwhile NextUpdate is still t0. -
UpdateFeeds
runs at t2, selecting F with NextUpdate t0 and queuing it for update — again. - The queue contains duplicate tasks for feed F.
With task names using F.NextUpdate
and F.StringID()
, the value for NextUpdate is the same in steps 2 & 5. In step 5 the task name collision prevents the duplicate task from queuing, avoiding duplicate work.
In practice I seem to see something like this in my self-hosted deployment. Here's an example of a duplicate task caught by this code, as logged: I suspect this one was a combination of slow datastore queries and slow URL fetches. First, here's the successful queue:
18:13:42.745
queuing feed 2014-12-15T01-55-10Z07-00_http_3A_2F_2Fwiki_2Exen_2Eprgmr_2Ecom_2Fxenophilia_2Fatom_2Exml
Two minutes later here's a duplicate task with the same value of NextUpdate
for the same feed URL.
18:15:43.281 error for task &{/tasks/update-feed [102 101 101 100 61 104 116 116 112 37 51 65 37 50 70 37 50 70 119 105 107 105 46 120 101 110 46 112 114 103 109 114 46 99 111 109 37 50 70 120 101 110 111 112 104 105 108 105 97 37 50 70 97 116 111 109 46 120 109 108] map[Content-Type:[application/x-www-form-urlencoded]] POST
2014-12-15T01-55-10Z07-00_http_3A_2F_2Fwiki_2Exen_2Eprgmr_2Ecom_2Fxenophilia_2Fatom_2Exml 0 0001-01-01 00:00:00 +0000 UTC 0 <nil>}: taskqueue: task has already been added
But the update finally happened and there were no further errors:
18:15:43.177 update feed http://wiki.xen.prgmr.com/xenophilia/atom.xml
18:15:44.759 hasUpdate: true, isFeedUpdated: false, storyDate: 2014-12-15 01:21:38.37879 +0000 UTC, stories: 15
18:15:44.765 0 update stories
18:15:44.765 putting 1 entities
18:15:45.133 next update scheduled for 30m16s from now
At 18:47:47.421 the same feed was updated again, this time without any duplicate tasks:
18:47:47.305 queuing feed 2014-12-15T02-46-01Z07-00_http_3A_2F_2Fwiki_2Exen_2Eprgmr_2Ecom_2Fxenophilia_2Fatom_2Exml
I thought of another situation worth mentioning: what if we queue an update-feed task but fetchFeed
fails? Maybe the server is temporarily down or unreachable, or it's returning error messages instead of good rss or atom. Will we be able to try again with a new task, or is this feed stuck forever?
When I wrote the task-name code I wasn't thinking about that. But looking at it now, I think it's ok because the existing feedError
handler in UpdateFeed
takes care of it. If anything goes wrong with the fetch or parse, NextUpdate
becomes time.Now().Add(time.Hour * time.Duration(v))
, where v
is calculated based on the number of errors previously observed. So when UpdatedFeeds
tries to queue a new task for this problem feed, it'll have an updated NextUpdate
value and so the task name will be ok.
Ok, I've had a chance to look over this fully now. I don't particularly care about human-readable task names, especially since the feed URL is already embedded in the task data. I'd prefer to use base64.URLEncoding for the names because I think it's safer and better tested.
The main part of the solution here is to append the feed next update time to the task name. This means we can no longer do a keys-only query (a small ops operation, which are free) to a read operation (not-free). I think we can keep the keys-only query and still achieve this. Obviously, this means we can no longer depend on the feed next update time. I believe we should instead use UpdateMin in settings.go like so: time.Now().Truncate(UpdateMin)
. Since we never want to update more than once per UpdateMin, this will work most of the time. It will still not work when that boundary is crossed, but it's much better than we have no with no additional cost. Although because of this boundary problem, I'm not convinced this is a good idea. Will ponder.
Looking at the log messages I posted above, would using UpdateMin
have helped? Also consider a scenario where many feeds are responding slowly and so the queue is backed up.
For me the read operation is worth the cost: it's staying well under 20% of the free quota. It may be a question of which is more expensive: the datastore reads or the extra tasks. But it might also be worth considering something based on TCP/IP timeouts and the settings from cron.yaml
and queue.yaml
. Another option might be to keep task metadata in memcache, if that's cheaper than using the datastore.