databroker icon indicating copy to clipboard operation
databroker copied to clipboard

Update indexes to use tie-break ObjectId

Open danielballan opened this issue 3 years ago • 17 comments

Paginated requests are always sorted so that the pages are deterministic. To avoid nondeterministic "ties" in the sorting parameters, they use ObjectId as a tie-breaker.

https://github.com/bluesky/databroker/blob/de3a2e9b2974353946af938ff6d6128623f1e625/databroker/mongo_normalized.py#L1561-L1568

This means that we need to include {"_id": 1} in all indexes, including:

> db.run_start.createIndex({"time": -1, "_id": 1})
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 4,
	"numIndexesAfter" : 5,
	"ok" : 1
}
> db.run_start.createIndex({"time": 1, "_id": 1})
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 5,
	"numIndexesAfter" : 6,
	"ok" : 1
}

danielballan avatar Nov 10 '21 22:11 danielballan

Are you sure that we need these indexes? Adding more indexes will increase insert time because a sorted index has log(n) insert time. The {"time": -1, "_id": 1} index seems redundant because the time is already embedded in the _id, and _id is indexed by default.

gwbischof avatar May 12 '22 16:05 gwbischof

The time field is finite precision and not guaranteed to be unique so if you ask for "give me documents [0-N]" and then "give me documents [N+1- 2N]" and the N and N+1 documents have the same time there is a chance that you will get the same document for the last of the first page and first of the second page and never see the other one!

tacaswell avatar May 12 '22 16:05 tacaswell

Why don't we use _id for time searches?

gwbischof avatar May 12 '22 16:05 gwbischof

collection.find({'_id': {'$lt': ObjectId.from_datetime(datetime_end), '$gt': ObjectId.from_datetime(datetime_start)}})

gwbischof avatar May 12 '22 16:05 gwbischof

Interesting, if that is reliable that would simplify things!

tacaswell avatar May 12 '22 16:05 tacaswell

This is interesting https://www.mongodb.com/docs/manual/core/index-creation/#index-builds-on-populated-collections

gwbischof avatar May 12 '22 16:05 gwbischof

I have been using it, and it works well

gwbischof avatar May 12 '22 16:05 gwbischof

The ObjectId is the document creation time, which may be close to start_doc.time but may not. Consider the case where a MongoDB was populated by databroker-unpack from msgpack documents. Or the case where the RunEngine ---> Kafka ---> suitcase-mongo pipeline takes more than a little time to complete due to burst-y document production or some transient failure.

Let's consider the index builds thing as a separate issue.

danielballan avatar May 12 '22 17:05 danielballan

In terms of priority, I think it's clear that fast (and correct...) search is more critical than RunStart document insert time.

danielballan avatar May 12 '22 17:05 danielballan

Given @danielballan 's point that the time in the start document is in general completely un-coupled from the insertion time. To go one further, RE -> kafka -> [some non-mongo on-disk format] -> wait a month -> load into suitcase-mongo searching based on time will be completely off.

I think in some other use cases this trick may be useful (e.g. when insertion into mongo time is a meaningful time), but in our use case we need to actually search on start_doc.time.

I agree with @danielballan the short term thing that needs to be done is to ensure that we have indexs in place that give fast search on time, uid, and scanid. Details about how to optimize the performance of the server should be handled else where.

tacaswell avatar May 12 '22 21:05 tacaswell

okay, makes sense. So which indexes do we need to add {"_id": 1} to?

gwbischof avatar May 12 '22 21:05 gwbischof

It looks like this one will do it.

https://github.com/bluesky/suitcase-mongo/blob/4895ad9784784bcbc346fbffdc7f1694f6739cc0/suitcase/mongo_normalized/init.py#L89-L91

danielballan avatar May 12 '22 21:05 danielballan

In practice, we are always sorting by time and looking up by time range or by scan_id. I think that one index covers both.

danielballan avatar May 12 '22 21:05 danielballan

ohh, I was a little confused by the first post that said all indexes.

gwbischof avatar May 12 '22 21:05 gwbischof

Fair, sorry for the false trail. I dashed this off in too much of a hurry, and I hadn't looked at the index creation in awhile.

danielballan avatar May 12 '22 21:05 danielballan

Do we need to add this one?

self._run_start_collection.create_index('scan_id', unique=False)

gwbischof avatar May 12 '22 21:05 gwbischof

Yes, I think anyplace where there can be a tie we need the tie breaker.

tacaswell avatar May 12 '22 21:05 tacaswell