datahike Durable persistent set

SUMMARY

Enable all stores to use with persistent set

Checks

Feature

[ ] Implements an existing feature request. Make sure the feature request has been accepted for implementation before opening a PR.
[ ] Related issues linked using fixes #number
[ ] Integration tests added
[ ] Documentation added
[ ] Architecture Decision Record added
[x] Formatting checked
[x] CHANGELOG.md updated

ADDITIONAL INFORMATION

Mar 28 '22 11:03 jsmassa

Before merging, following tasks should be done:

[ ] Remove set jar and import automatically from GitHub
[ ] Evaluate performance compared to old in-memory persistent set implementation

Apr 12 '22 14:04 jsmassa

Fixes #519

May 30 '22 06:05 jsmassa

Fixes #507 by enabling identical tests

May 30 '22 11:05 jsmassa

Update: Thread safety is now covered in an update to the persistent-sorted-set, providing (hopefully) safe bounded memory usage under concurrent reads (and cache evictions). I have stress tested the implementation with the integrated benchmark suite and small cache size, but there is still an issue with store management where I think the store is deleted including a needed value before the query tests are run. This potentially was not visible before as more data was hold in memory. This issue needs to be solved and it needs to be verified that performance satisfies the requirements in general. We also need to decide where to put the cache and make sure we only have one per database. Other than that I think we are mostly there.

Aug 23 '22 03:08 whilo

Ok, I have mostly completed the open issues I saw and see this PR as basically done. Note that cache sizes now need to be greater than 0 because the cache constructors do not support 0 sized caches (there was still a 1000 hardcoded, I think accidentally). I can do further performance tuning of the comparators and the set if needed, otherwise I would suggest to do this in another PR. I think we want to have a persistent-sorted-set release before merging this, I can coordinate with @tonsky to land this.

Aug 25 '22 04:08 whilo

I have worked on the upsert comparator to get the migration performance back to a sensible level. The jobtech benchmarks are also showing good results now, so from this point of view, this PR is finally ready now.

A release of the set before merging would be nice.

The use of the :cache-size parameter for konserve confused me. It should be an separate parameter, so the caches can be controlled independent of each other. I will change it accordingly.

Aug 31 '22 12:08 jsmassa

I agree, but I would also say that ideally we should only need the persistent-sorted-set cache now. We should deprecate and drop the hitchhiker-tree as backend once the set performs sufficiently well and then its caching mechanism is providing the same logic as the underlying konserve cache had, so that will be redundant and I think we can drop the konserve cache.

I think the same should hold for the cache we currently have for slicing (search), the persistent sorted set should be fast enough to drop it, unless our comparators slow down the slicing so much that we need to cache the results because of this slowness. In terms of pure iterator speed on top of integers the pss is the same for me as Clojure's persistent vector which is used for the slicing cache. The benefit of focusing all the caching budget we have now to the pss cache is that it is shared between all database instances even while transacting into the databases while also deduplicating any overlap that would exist between different slices of the indices cached. We have to measure the impact of this though.

Sep 01 '22 00:09 whilo

We can set the default parameters accordingly to promote the parameter combination we see as optimal. The search cache can be turned off completely by setting it to 0 anyway. I wouldn't drop anything (yet) though since at some point we also want to explore more (elaborate) data structures for the index and eventually have one that is more engineered to the problem and integrated into the project. Until then it's nice to have some options for experimentation.

Sep 01 '22 06:09 jsmassa

I started to investigate the current problems and found out that both the current version and the prior version with my changes are not working properly with the jobtech benchmarks. @kordano has encountered this problem as well and is working on finding the root cause.

Sep 01 '22 14:09 jsmassa

Blocked by #542

Nov 03 '22 13:11 jsmassa

I have run the jobtech benchmarks with 100 iterations and the results look fine. The median of the measurements is almost the same and the average is usually better. Here are some measurements (in ms):

base-urls:

main	dps
29.028	29.436
52.818	90.214
2460.009	2100.082

graphql-basic:

main	dps
7416.746	3491.175
27.092	35.937
961.287	455.830

graphql-benchmark:

main		dps
min	max	min	max
167.5	2091.6	168.1	1575.4
23.8	195350.6	24.1	91244.2
10.2	952.2	9.5	448.0
9.3	669.6	9.4	334.0
17.3	12481.2	18.8	5693.7
8.9	946.2	9.0	450.0

endpoints-benchmark:

main		dps
min	max	min	max
9.1	14.0	8.9	14.6

In other runs the base-url benchmark and the graphql-basic benchmarks were all better with this branch than with the main branch, so all-in-all I would suggest finally merging this into main.

What do you think, @kordano?

Nov 08 '22 19:11 jsmassa

Looks good, but fails on my machine with:
Syntax error compiling at (datahike/store.cljc:101:5).
No such var: fs/connect-fs-store

This indicates that you use an old konserve version somehow.

Nov 11 '22 07:11 whilo

nice one indeed!

Nov 17 '22 11:11 mpenet

@mpenet Nice to hear from you! https://github.com/replikativ/datahike/pull/232 is about to be merged next and should provide some of the functionality you were interested in (gc and versioning), let me know what you think and whether anything could be added to make it work better for you.

Nov 17 '22 21:11 whilo

datahike datahike copied to clipboard

Durable persistent set

SUMMARY

Checks

Feature

ADDITIONAL INFORMATION

datahike
datahike copied to clipboard