datahike
datahike copied to clipboard
Refactor query and search for better performance
This code mainly refactors the namespaces datahike.query
and datahike.db.search
to improve performance of the query engine. This means that we remove the previous "relprod strategies" and the algorithm in this PR best resembles the "select-all" strategy that we had before. I believe there are many things that can contribute to the speed improvement, but this refactoring means that work is moved out from the innermost loop of the query engine and this loop now consists of a transduction inside search-batch-fn
. Using transducers probably avoid the creation of lots of short-lived intermediate values and lazy sequences and seem to contribute to the performance. In some cases I use macros to move work out of loops and that comes at a cost in terms of readability. I explored simpler ways of writing the code but in the places where I do use macros to generate code to avoid work at runtime, they seem to be justified, see for example the comments in the function single-substitution-xform
.
Here are the results. The results for this PR is labeled Datahike new query/search
in the table below:
TARGET ABS TIME (s) REL TIME
Some other db 4.471 6%
Datahike 0.6.1558 74.294 100%
Datahike 0.6.1559 51.827 70%
Datahike new query/search 26.377 36%
The code to run the benchmark is found at https://gitlab.com/arbetsformedlingen/taxonomy-dev/backend/experimental/datahike-benchmark/.
I believe @whilo will want to review this code, he knows what it is about.
Can you rebase this? I hope it is not too annoying.
Thanks @whilo for the review so far! I have rebased and also improved the implementation of distinct-tuples
as you requested.
@jonasseglare could you do a rebase again?
Before moving forward with this PR, I suggest we first review the code in https://github.com/replikativ/datahike/pull/691 .
I rebased on the latest main and force-pushed this branch.
I pushed some commits that address the last comment about :type :lookup
.
And I did some measurements again, suggesting that these changes will make Datahike about twice as fast:
TARGET ABS TIME (s) REL TIME
Some other db 2.319 10%
Datahike 0.6.1573 23.727 100%
refactor-query-and-search 12.278 52%