metaphysics icon indicating copy to clipboard operation
metaphysics copied to clipboard

[Home] Optimise random works queries

Open alloy opened this issue 7 years ago • 6 comments

MongoDB does not have built-in ‘random’ functionality, some possibilities:

  1. Add a ‘random’ attribute to the collection: https://github.com/mongodb/cookbook/blob/master/content/patterns/random-attribute.txt
  2. Skip: https://alan-mushi.github.io/2015/01/18/mongodb-get-random-document-benchmark.html#the-skip-trick-method-4a--4b
  3. Alternatively we could make something ourselves that is based on the way the MP code does it now.
work_set_size = size * 3
scope = Model.where(…)
work_set = scope.skip(rand(scope.count - work_set_size)).limit(work_set_size)
work_set.to_a.shuffle.first(size)

@cavvia @joeyAghion @mzikherman Do you have any thoughts on this?

alloy avatar Oct 20 '16 09:10 alloy

I'm just wondering what use case requires the selection of random works? Are you interested in rotating content? It doesn't sound like an optimal ranking strategy for most contexts I can think of.

Sorting by descending merchandisability, iconicity, or creation date are some of our other options. The v1/filter/artworks endpoint also supports a -decayed_merch sort which combines artwork freshness and merchandisability score.

cavvia avatar Oct 20 '16 14:10 cavvia

Aye, yeah, it’s about rotation, but specifically on each reload, as discussed during the home personalisation talks: https://docs.google.com/document/d/1QJ5NNK_LqVwomlqIg3MOojtRxgTxg9ky4q6ynfPrrUg/edit#heading=h.xn6uqaz56o0r

alloy avatar Oct 20 '16 15:10 alloy

I agree that we probably don't want truly random results, but we might want a variety from the "best" likely candidates (e.g., using some of the sorts @cavvia mentions). If we can accomplish this with existing page/size parameters to skip a small number of results, I wouldn't be too worried about performance. If we really want to skip to a random result in the collection (i.e., skipping a very large number), I would.

joeyAghion avatar Oct 20 '16 16:10 joeyAghion

@joeyAghion I’m unsure what the preferred option is that you’re referring to. Is it option 3 where we implement a version similar to what MP does on Gravity or are you suggesting we do fetch 3 times the data from Gravity than is actually being requested by the client?

alloy avatar Oct 21 '16 10:10 alloy

I was just suggesting that we sort by something meaningful and then index randomly into the early pages (e.g., fairs?size=5&page=[1-10], similar to (2) but using existing parameters). But now that I look at the examples you link to, I realize there may not be enough data to page significantly. Maybe (3) [or basically what MP does now] is best in the short term. Returning 60 results for each row that only reveals 5-6 is a lot though! Could we decrease that to ~20?

joeyAghion avatar Oct 21 '16 14:10 joeyAghion

Yea, I think the sentiments @cavvia and @joeyAghion sound like the right track.

We've done the random attribute thing, as I think that may be the only way to really get a random shuffling from a bigg-ish collection. The potentially large arbitrary skip/offset can get super slow and you wind up not really using indexes properly.

I think (3) sounds like the most reasonable option, pretty much like what @joeyAghion was suggesting.

mzikherman avatar Oct 21 '16 14:10 mzikherman