databroker Update indexes to use tie-break ObjectId

Paginated requests are always sorted so that the pages are deterministic. To avoid nondeterministic "ties" in the sorting parameters, they use ObjectId as a tie-breaker.

https://github.com/bluesky/databroker/blob/de3a2e9b2974353946af938ff6d6128623f1e625/databroker/mongo_normalized.py#L1561-L1568

This means that we need to include {"_id": 1} in all indexes, including:

> db.run_start.createIndex({"time": -1, "_id": 1})
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 4,
	"numIndexesAfter" : 5,
	"ok" : 1
}
> db.run_start.createIndex({"time": 1, "_id": 1})
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 5,
	"numIndexesAfter" : 6,
	"ok" : 1
}

Nov 10 '21 22:11 danielballan

Are you sure that we need these indexes? Adding more indexes will increase insert time because a sorted index has log(n) insert time. The {"time": -1, "_id": 1} index seems redundant because the time is already embedded in the _id, and _id is indexed by default.

May 12 '22 16:05 gwbischof

The time field is finite precision and not guaranteed to be unique so if you ask for "give me documents [0-N]" and then "give me documents [N+1- 2N]" and the N and N+1 documents have the same time there is a chance that you will get the same document for the last of the first page and first of the second page and never see the other one!

May 12 '22 16:05 tacaswell

Why don't we use _id for time searches?

May 12 '22 16:05 gwbischof

collection.find({'_id': {'$lt': ObjectId.from_datetime(datetime_end), '$gt': ObjectId.from_datetime(datetime_start)}})

May 12 '22 16:05 gwbischof

Interesting, if that is reliable that would simplify things!

May 12 '22 16:05 tacaswell

This is interesting https://www.mongodb.com/docs/manual/core/index-creation/#index-builds-on-populated-collections

May 12 '22 16:05 gwbischof

I have been using it, and it works well

May 12 '22 16:05 gwbischof

The ObjectId is the document creation time, which may be close to start_doc.time but may not. Consider the case where a MongoDB was populated by databroker-unpack from msgpack documents. Or the case where the RunEngine ---> Kafka ---> suitcase-mongo pipeline takes more than a little time to complete due to burst-y document production or some transient failure.

Let's consider the index builds thing as a separate issue.

May 12 '22 17:05 danielballan

In terms of priority, I think it's clear that fast (and correct...) search is more critical than RunStart document insert time.

May 12 '22 17:05 danielballan

Given @danielballan 's point that the time in the start document is in general completely un-coupled from the insertion time. To go one further, RE -> kafka -> [some non-mongo on-disk format] -> wait a month -> load into suitcase-mongo searching based on time will be completely off.

I think in some other use cases this trick may be useful (e.g. when insertion into mongo time is a meaningful time), but in our use case we need to actually search on start_doc.time.

I agree with @danielballan the short term thing that needs to be done is to ensure that we have indexs in place that give fast search on time, uid, and scanid. Details about how to optimize the performance of the server should be handled else where.

May 12 '22 21:05 tacaswell

okay, makes sense. So which indexes do we need to add {"_id": 1} to?

May 12 '22 21:05 gwbischof

It looks like this one will do it.

https://github.com/bluesky/suitcase-mongo/blob/4895ad9784784bcbc346fbffdc7f1694f6739cc0/suitcase/mongo_normalized/init.py#L89-L91

May 12 '22 21:05 danielballan

In practice, we are always sorting by time and looking up by time range or by scan_id. I think that one index covers both.

May 12 '22 21:05 danielballan

ohh, I was a little confused by the first post that said all indexes.

May 12 '22 21:05 gwbischof

Fair, sorry for the false trail. I dashed this off in too much of a hurry, and I hadn't looked at the index creation in awhile.

May 12 '22 21:05 danielballan

Do we need to add this one?

self._run_start_collection.create_index('scan_id', unique=False)

May 12 '22 21:05 gwbischof

Yes, I think anyplace where there can be a tie we need the tie breaker.

May 12 '22 21:05 tacaswell

databroker databroker copied to clipboard

Update indexes to use tie-break ObjectId

databroker
databroker copied to clipboard