datahike icon indicating copy to clipboard operation
datahike copied to clipboard

Refactor query and search for better performance

Open jonasseglare opened this issue 9 months ago • 3 comments

This code mainly refactors the namespaces datahike.query and datahike.db.search to improve performance of the query engine. This means that we remove the previous "relprod strategies" and the algorithm in this PR best resembles the "select-all" strategy that we had before. I believe there are many things that can contribute to the speed improvement, but this refactoring means that work is moved out from the innermost loop of the query engine and this loop now consists of a transduction inside search-batch-fn. Using transducers probably avoid the creation of lots of short-lived intermediate values and lazy sequences and seem to contribute to the performance. In some cases I use macros to move work out of loops and that comes at a cost in terms of readability. I explored simpler ways of writing the code but in the places where I do use macros to generate code to avoid work at runtime, they seem to be justified, see for example the comments in the function single-substitution-xform.

Here are the results. The results for this PR is labeled Datahike new query/search in the table below:

TARGET                      ABS TIME (s)   REL TIME
Some other db                      4.471         6%
Datahike 0.6.1558                 74.294       100%
Datahike 0.6.1559                 51.827        70%
Datahike new query/search         26.377        36%

The code to run the benchmark is found at https://gitlab.com/arbetsformedlingen/taxonomy-dev/backend/experimental/datahike-benchmark/.

I believe @whilo will want to review this code, he knows what it is about.

jonasseglare avatar May 09 '24 07:05 jonasseglare

Can you rebase this? I hope it is not too annoying.

whilo avatar Jun 17 '24 00:06 whilo

Thanks @whilo for the review so far! I have rebased and also improved the implementation of distinct-tuples as you requested.

jonasseglare avatar Jun 20 '24 12:06 jonasseglare

@jonasseglare could you do a rebase again?

whilo avatar Jul 31 '24 04:07 whilo

Before moving forward with this PR, I suggest we first review the code in https://github.com/replikativ/datahike/pull/691 .

jonasseglare avatar Aug 28 '24 17:08 jonasseglare

I rebased on the latest main and force-pushed this branch.

jonasseglare avatar Aug 28 '24 22:08 jonasseglare

I pushed some commits that address the last comment about :type :lookup.

jonasseglare avatar Sep 10 '24 12:09 jonasseglare

And I did some measurements again, suggesting that these changes will make Datahike about twice as fast:

TARGET                      ABS TIME (s)   REL TIME
Some other db                      2.319        10%
Datahike 0.6.1573                 23.727       100%
refactor-query-and-search         12.278        52%

jonasseglare avatar Sep 10 '24 13:09 jonasseglare