kombu `DirectExchange` type always lookup for queue in `delivery` method

I have profiled the apply_async method, and I've figured out that a lot of time is being spend on _lookup method in exchange module if redis is the backend for celery(thus calling SMEMBERS function). I didn't find a clean way to add another exchange type, so in the end I've monkey patched the DirectExchange type to only execute _put method. I would appreciate if someone could point out how can I define custom exchange type.

Check out the snakeviz(profile) graph: publish_task You can see^ that 22.7s are being spent on the _lookup method, even more time than for the actual _put method.

So is this _lookup really required, and could it be skipped without monkey patching?

Jan 24 '19 14:01 stevanmilic

To be honest, I haven't looked into the implementation of the Redis broker at all.

The only place we call SMEMBERS is in get_table(): https://github.com/celery/kombu/blob/3a0454c21d5eee7b641af31ac6e046296aa38d57/kombu/transport/redis.py#L824-L830

get_table() is called in _lookup() as you said: https://github.com/celery/kombu/blob/3a0454c21d5eee7b641af31ac6e046296aa38d57/kombu/transport/virtual/base.py#L700-L726

Since we're talking about the direct exchange we can use SISMEMBER instead to check if a queue with the same name exists in the set.

We could introduce a _lookup_direct method to the virtual channel which will do so when needed.

Care to provide the PR?

Apr 03 '19 09:04 thedrow

@thedrow I've made a PR. Although the PR does bring some optimization, it didn't solve my problem. Exchanges have only one queue set, so the lookup with SMEMBERS comes down to O(1) anyway.

It's up to you to decide if we need this feature.

May 07 '19 21:05 stevanmilic

Do you happen to have an analysis of how much this improved the performance of the Redis broker? Do you know what else could be affecting performance in this case?

May 12 '19 08:05 thedrow

Do you happen to have an analysis of how much this improved the performance of the Redis broker?

Last time I measured it, it took same amount of time.

Do you know what else could be affecting performance in this case?

I initially wrote that I wanted to skip using the lookup method for redis. I got this idea from gocelery -- Go implemention of this library, they are not using lookup when sending message to celery(code). Instead they are just pushing messages directly to the queue. Of course, that means they're not supporting other types of virtual exchanges.

So the solution could be to not use the lookup for direct exchange, and just push directly to the queue. Or to call the lookup on start of the session or something. I didn't think this through, so it's probably flawed. What do you think about it?

May 12 '19 19:05 stevanmilic

You have to check that the exchange still exists. If it's gone and there is an existing queue we have a bug.

I was surprised by the fact that #1041 did not bring a performance improvement at all. There's a difference in payload size between a list of keys in a set and a boolean. But it probably doesn't matter since the difference is too small. I also find it unreasonable that looking for one key in a set or fetching it from a set should take so long. Can you reproduce this locally without your usual load or does this problem only reproduces under stress?

What happens when you skip the exchange?

May 15 '19 15:05 thedrow

@thedrow From the trace we see that the issue is in the call to get_table -- my changes in #1089 change the _lookup method of the redis transport to not use get_table -- the prior implementation that called sismemebers still called get_table in the lookup_direct method. I preferred maintaining a consistent _lookup api over adding new methods -- the old method had the performance issue, so leaving it around and adding a new method was not a goal of mine. Please see #1089

Aug 29 '19 12:08 matteius

As far as I remember, there weren't any differences between sismember and smembers. The main reason is that exchange had only one queue is set so it both method come down to O(1).

@matteius the issue isn't in the get_table itself, it's that _lookup is having an additional call to redis, which I thought wasn't needed. But as @thedrow stated, you have to check if exchange still exists.

@thedrow sorry for abandoning this issue, but I think we can close it, since IMO this is expected behavior.

Aug 29 '19 13:08 stevanmilic

@stevanmilic I think there is a caveat to what you are saying. get_table calls smemebers O(n) which has to retrieve all the memebers for the set from redis which could be many, and then to parse that in Python. Instead Redis sismember just checks if the one member is a member of that set and returns a boolean which is O(1) -- my optimization in the other PR calls just sismember by not calling get_table -- so technically get_table is the performance because it is O(n) and now the reids _lookup implemetnation calls scard and sismemember which are both O(1).

See: https://redis.io/commands/smembers https://redis.io/commands/scard https://redis.io/commands/sismember

Thats why if you can profile this again, I think it will perform better for even users that don't check of the exchange still exists.

Aug 29 '19 13:08 matteius

Ye I know that. But in my case there was always a single queue(element) in the exchange(set). I'm not sure in which cases kombu creates multiple queues for a single exchange when working with redis.

Aug 29 '19 13:08 stevanmilic

@stevanmilic In celery you can bind multiple queues on a single exchange -- we do this all the time. How it organizes the data in Redis is to use a set for the exchanges, and then each exchange has a key set of all the queues -- the queues are named redis lists of separate keys -- get_table works on the metadata of the the whole set of queues on an Exchange which is a practical use case. However I also don't understand the timing on your original code profile -- no way any of that code is taking 67 seconds to publish a message under any normal circumstances.

Aug 29 '19 13:08 matteius

If I recall correctly, I run a profiler when pushing ~100k elements on a VM, so that is accumulated time. My mistake for not stating that in the post.

When I ran tests, I did it with single exchange and one queue in it. That was also our production setup.

My point on smembers vs sismembers, sure O(1) is better then O(n), but n shouldn't be too large, correct me if I'm wrong.

Aug 29 '19 13:08 stevanmilic

Well n could be very large, which is what I thought you were reporting initially because I did not realize it was a test of ~100K messages/elements. Anyway, if I do the math on that then, 67.1s/100000message *1000(ms/s) = 0.671 ms/message which seems like not a performance problem either way. However I do think O(1) will be much better for users of many queues on a single exchange, which is what my changes do, so thanks for reporting this. This issue can probably be closed, but if you would run the same test again using my branch, I would be curious to know that it works and what the comparison of performance is for ~100K messages.

Aug 29 '19 14:08 matteius

kombu kombu copied to clipboard

`DirectExchange` type always lookup for queue in `delivery` method

kombu
kombu copied to clipboard