databroker
databroker copied to clipboard
Update indexes to use tie-break ObjectId
Paginated requests are always sorted so that the pages are deterministic. To avoid nondeterministic "ties" in the sorting parameters, they use ObjectId as a tie-breaker.
https://github.com/bluesky/databroker/blob/de3a2e9b2974353946af938ff6d6128623f1e625/databroker/mongo_normalized.py#L1561-L1568
This means that we need to include {"_id": 1}
in all indexes, including:
> db.run_start.createIndex({"time": -1, "_id": 1})
{
"createdCollectionAutomatically" : false,
"numIndexesBefore" : 4,
"numIndexesAfter" : 5,
"ok" : 1
}
> db.run_start.createIndex({"time": 1, "_id": 1})
{
"createdCollectionAutomatically" : false,
"numIndexesBefore" : 5,
"numIndexesAfter" : 6,
"ok" : 1
}
Are you sure that we need these indexes? Adding more indexes will increase insert time because a sorted index has log(n) insert time. The {"time": -1, "_id": 1}
index seems redundant because the time is already embedded in the _id
, and _id
is indexed by default.
The time field is finite precision and not guaranteed to be unique so if you ask for "give me documents [0-N]" and then "give me documents [N+1- 2N]" and the N and N+1 documents have the same time there is a chance that you will get the same document for the last of the first page and first of the second page and never see the other one!
Why don't we use _id
for time searches?
collection.find({'_id': {'$lt': ObjectId.from_datetime(datetime_end), '$gt': ObjectId.from_datetime(datetime_start)}})
Interesting, if that is reliable that would simplify things!
This is interesting https://www.mongodb.com/docs/manual/core/index-creation/#index-builds-on-populated-collections
I have been using it, and it works well
The ObjectId is the document creation time, which may be close to start_doc.time
but may not. Consider the case where a MongoDB was populated by databroker-unpack from msgpack documents. Or the case where the RunEngine ---> Kafka ---> suitcase-mongo pipeline takes more than a little time to complete due to burst-y document production or some transient failure.
Let's consider the index builds thing as a separate issue.
In terms of priority, I think it's clear that fast (and correct...) search is more critical than RunStart document insert time.
Given @danielballan 's point that the time in the start document is in general completely un-coupled from the insertion time. To go one further, RE -> kafka -> [some non-mongo on-disk format] -> wait a month -> load into suitcase-mongo searching based on time will be completely off.
I think in some other use cases this trick may be useful (e.g. when insertion into mongo time is a meaningful time), but in our use case we need to actually search on start_doc.time
.
I agree with @danielballan the short term thing that needs to be done is to ensure that we have indexs in place that give fast search on time, uid, and scanid. Details about how to optimize the performance of the server should be handled else where.
okay, makes sense. So which indexes do we need to add {"_id": 1}
to?
It looks like this one will do it.
https://github.com/bluesky/suitcase-mongo/blob/4895ad9784784bcbc346fbffdc7f1694f6739cc0/suitcase/mongo_normalized/init.py#L89-L91
In practice, we are always sorting by time and looking up by time range or by scan_id
. I think that one index covers both.
ohh, I was a little confused by the first post that said all indexes.
Fair, sorry for the false trail. I dashed this off in too much of a hurry, and I hadn't looked at the index creation in awhile.
Do we need to add this one?
self._run_start_collection.create_index('scan_id', unique=False)
Yes, I think anyplace where there can be a tie we need the tie breaker.