mod_tile icon indicating copy to clipboard operation
mod_tile copied to clipboard

Update render queue limits

Open Komzpa opened this issue 8 years ago • 15 comments
trafficstars

Adjust render queue depth so that a rendering server dirty queue length is shifted from several minutes (1000 / 7 ~= 142 s ~= 2.3 minutes) to a whole day.

Metrics are taken from http://munin.osm.org/static/dynazoom.html?cgiurl_graph=/munin-cgi/munin-cgi-graph&plugin_name=openstreetmap/yevaud.openstreetmap/renderd_processed&size_x=800&size_y=400&start_epoch=1483213021&stop_epoch=1483904221

Komzpa avatar Jan 08 '17 19:01 Komzpa

The values are tailored based on osm.org statistics.

On non-osm.org installs:

  • if it's a local small install, the queue just won't grow. No major change.
  • if it's a popular but not overloaded install, the queue will grow in the same fashion as on osm.org, and will be shrinking on non-popular hours. Starting from day 2 the number of fresh tiles served from cache directly grows, outweighing the queuing overhead by a lot.
  • if it's an overloaded install, the queue would be large and won't shrink. This adds some time for queuing, but this shouldn't be noticeable as the install is overloaded and is unable to serve tiles on time anyway.

Komzpa avatar Jan 08 '17 22:01 Komzpa

hello, what do we do to get it merged?

Komzpa avatar Nov 15 '17 11:11 Komzpa

Not having our servers running in full meltdown mode 24x7 could of course be considered a feature rather than a bug...

tomhughes avatar Nov 15 '17 16:11 tomhughes

Incidentally I believe the current queue lengths are the results of previous experimentation by @apmon although I can't claim to remember all the details.

I have some sort of memory that there were performance issues managing the queue if it was too big?

tomhughes avatar Nov 15 '17 16:11 tomhughes

There's a lot of time between 2 minutes and the whole day. Is it possible to test a single server with something like an hour or 10 minutes for example?

kocio-pl avatar Nov 21 '17 15:11 kocio-pl

@kocio-pl I can provide such a branch. Does anybody have a machine to run it on, with a proper load? @tomhughes can we do a split-test on tile.osm.org? :)

Komzpa avatar Nov 26 '17 20:11 Komzpa

Not easily. At the very least you would need to provide Ubuntu compatible binary packages.

tomhughes avatar Nov 26 '17 20:11 tomhughes

I guess you could ask @rrzefox if he's willing to test it.

kocio-pl avatar Nov 26 '17 20:11 kocio-pl

Maybe it makes sense to start with making these variables configurable through mod_tile.conf. Once the code and package are updated, it would be much easier to experiment and even tweak it the way any given server admins like it.

kocio-pl avatar Nov 26 '17 21:11 kocio-pl

I guess you could ask @rrzefox if he's willing to test it.

Sorry, no can do - I'm using tirex, not renderd. Which incidentally does not have a queue limit, at least none I've ever hit...

Maybe it makes sense to start with making these variables configurable through mod_tile.conf.

You probably mean renderd.conf? mod_tile.conf would be something read by apache, and apache is not handling the queue, the render-daemon is. But indeed, something like queue length is a value I'd expect to be configurable, not hardcoded.

rrzefox avatar Nov 26 '17 21:11 rrzefox

Making it configurable would certainly be a good idea - then we could do a new build and test different configurations.

tomhughes avatar Nov 26 '17 21:11 tomhughes

I'm a bit confused by this proposed change. There are basically 5 different queue classes, which are being handled by request_queue_fetch_request in the following order:

  1. Priority Request Queue,
  2. Request Queue,
  3. Low Priority Request Queue,
  4. Dirty Queue, and
  5. Bulk Request queue.

As long as there are queue entries in a higher class, they will be processed first. (I wonder why this doesn't lead to starvation issues, but that's another issue).

To my understanding, changing the DIRTY_LIMIT will only affect entries in category (4), that is the Dirty Queue. However, this queue seems pretty empty most of the time anyway: munin link

Isn't this queue only relevant, if someone adds /dirty to an URL to manually trigger a re-rendering of a tile?

mmd-osm avatar Dec 03 '17 14:12 mmd-osm

hey @mmd-osm, Dirty Queue is not rendered during high load times, and not kept until low load times. https://munin.openstreetmap.org/openstreetmap/render.openstreetmap/renderd_queue.html

Dirty queue holds all the tiles that were rendered previously and marked dirty due to someone making an edit in the area. They can be rendered on change instead of being rendered later on the first request.

Komzpa avatar Dec 03 '17 15:12 Komzpa

Thanks, your munin link makes a lot more sense than mine. So yes, Dirty Queue as prio 4 both has to discard entries due to its currently limited size, and there's also a starvation issue here, because most everything else is more important than processing those dirty tiles.

Bounded overtaking would be one potential approach to limit those starvation effects, but that's currently not implemented. It may be even better in the long run than just increasing the dirty queue size.

Also, I really thought, only /dirty would add new entries to the Dirty Queue, thanks for clarifying that bit.

mmd-osm avatar Dec 03 '17 15:12 mmd-osm

Thought I'd throw in some clarifying remarks.

Assuming this information is still accurate, I haven't looked at the code in quite a while so I don't know if things have changed in this respect.

Priority Request Queue, Request Queue, Low Priority Request Queue are all meant for on the fly rendering. i.e. if a user requests a tile in mod_tile and it is out of date in some way or another, mod_tile delays serving that tile while it tries to render it. So these renderings are "time critical" as they delay the serving of a tile. Priority Request Queue thereby handles tiles that have no stale cached tile. I.e. if the on the fly rendering fails, mod_tile will deliver a 404 tile not found response. That is why this is the highest priority. Request Queue servers tiles that have a cached stale tile for which we know the data has changed, so if mod_tile ends up serving the cached version as renderd doesn't render the tile in time, the user will see "wrong"/"old" geometry/features. Low Priority Queue is for when the style sheet has changed, but not the data it self.

As all of these queues are designed for on the fly rendering, their size is limited to 32 entries. It is not reasonable to believe renderd will render more than 32 metatiles in the time before mod_tile times out and serves either a stale or 404 tile. So there is no point in having a longer queue there, as that would just add to the starvation issue.

If a tile tries to be added to lets say the request queue, but the queue is full, it spills over into the dirty queue, which is the "large batch background" rendering queue which isn't "time critical" anymore, as there is no user directly waiting for the answer. This queue is there to try and steady the load on the server and make sure that the server is still doing something useful during low load times.

So increasing the queue length beyond 2 minutes definitely seems good. However, there are definitely a couple of issue with increasing the dirty queue size. As as been pointed out, the dirty queue can easily be starved out. Now if during a burst of high load the priority queue with missing tiles overflow, some of those missing tiles request end up in the dirty queue. If the load then drops a little so that the priority queue is no longer full, but e.g. the low priority queue is still saturated, then renderd will only render those style sheet changed tiles, but never get around to rendering the missing tile that has dropped into the dirty queue. If a new user now tries to look at that dirty tile, mod_tile will submit it to renderd again, but as it is already queued (in the dirty queue) renderd won't requeue it into the priority queue, even if the priority queue could now hold the request. So it is possible to starve out missing metatiles in the dirty queue, which is very bad. With a shorter dirty queue this risk of starving important tiles out is somewhat reduced.

Furthermore, back then (not sure if that has changed by now), renderd didn't really use any library routines. So instead of using a decent quality existing implementation of a priority queue, I coded up my own implementation. And that implementation was in the category "good enough at the time", not really in the "good" category.

So ideally that hole queuing implementation gets replaced by a proper library implementation of a priory queue with re-queueing of higher priority requests and with something like an epsilon greedy dequeueing strategy to reduce starvation issues.

apmon avatar Dec 03 '17 18:12 apmon