mastodon icon indicating copy to clipboard operation
mastodon copied to clipboard

Flatten `Status.tagged_with_all` scope

Open mjankowski opened this issue 1 year ago • 3 comments

Previously we were looping through all the supplied tag ids and appending a new condition for each one so that the generated query was a collection of AND'd EXISTS statements. Query times under this approach are reasonable with small tag id counts, but grow as the tag id array size grows.

The update uses a group by / having approach instead. Query times remain fairly constant under this approach, with new IDs added to an IN condition.

As I was working on this, it seemed vaguely familiar and I realized there was a prior attempt at similar change:

  • Original attempt which fixed brakeman warning and converted the structure to a subquery approach -- https://github.com/mastodon/mastodon/pull/25882 -- we benchmark'd and discovered bad performance with the subquery
  • Actual change which fixed brakeman and moved from many INNER JOIN to many EXISTS -- https://github.com/mastodon/mastodon/pull/25941

I think the change here (by avoiding subquery) should be more performant than the last approach ... but I'd love feedback on both perf aspects, and whether there are any scenarios we need to expand spec coverage on. There is coverage for this around various scenarios of statuses with none/some/all of the passed in tags, but if we have edge cases I'd like to get them in specs before fully reviewing this.

I did some lightweight local benchmarking -- I created 1000 each Status and Tag, then I looped through all statuses, assigned 25 random tags to each.

First, I found the most used tags, and passed in an increasing number of tag IDs from that collection into the query:

Tag IDs Old New
1 1.0 1.2
2 1.1 1.4
3 1.7 1.5
4 2.6 1.7
5 6.7 1.6

Then I took a subset of tags actually used by a random status, and passed those in, also increasing number each time:

Tag IDs Old New
1 1.1 1.3
2 1.1 1.3
3 1.7 1.3
4 2.6 1.4
5 4.4 1.5

All times are in ms there.

Running explain on the queries makes sense of this as well ... in the current approach of appending more and more exists together, the explain tree keeps getting more and more nested the more IDs there are. The approach here keeps the explain structure flat, and just increases the IDs in the IN portion. So the performance is similar for small number of IDs, and starts to skew in favor of the approach here once its larger and larger (on comically larger values its very dramatic -- ie, 25 tag IDs is ~88ms on current approach, and ~2ms on this PR approach).

mjankowski avatar Mar 27 '24 15:03 mjankowski

If you have sample SQL queries (and Rails commands to find relevant objects to run them with), we can try running them on mastodon.social and see how is the perf

renchap avatar Mar 27 '24 16:03 renchap

If you have sample SQL queries (and Rails commands to find relevant objects to run them with), we can try running them on mastodon.social and see how is the perf

Good idea - will add something shortly. I just realized that the change here fails to handle the "nothing passed in" case. Going to add unit spec for that, and update. After that, will post some benchmark ideas.

mjankowski avatar Mar 27 '24 16:03 mjankowski

Updated with a commit to just avoiding calling the scope w/ nil values for now.

For benchmarking ... you could grab some collection of the most/least used tag ids as an array in irb, and then pass them through the original and changed method. For my local run I renamed the old one tagged_with_all_original and ran them side by side a bunch. You could call the new one something else, or just using the scope chain directly w/out defining a scope/method.

May be useful to put an explain after each to see the full plan cascade effect as well.

mjankowski avatar Mar 27 '24 17:03 mjankowski

Did we ever do any benchmarking on this, or do I need to provide a more detailed code snippet to use?

mjankowski avatar Oct 23 '24 13:10 mjankowski