rdf4j
rdf4j copied to clipboard
GH-3843 collection factory
GitHub issue resolved: #3843
PR Author Checklist (see the contributor guidelines for more details):
- [x] my pull request is self-contained
- [ ] I've added tests for the changes I made
- [x] I've applied code formatting (you can use
mvn process-resources
to format from the command line) - [x] I've squashed my commits where necessary
- [x] every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change
@kenwenzel would you mind having a look at this? See if you like the ideas.
@kenwenzel would you mind having a look at this? See if you like the ideas.
@JervenBolleman I like the idea of making the collections configurable for different SAIL implementations. But currently, I can't oversee what collection types are required. In addition to changing the entire collection implementations, we could use a mapper interface to convert between SAIL-specific values and IDs.
Looks like github action for this pull request is hanging. I think I will close it in a bit and reopen it.
I am now at the point of actually wiring this up into the evaluation strategy. I think I don't have a good API placement or at least I am not to happy with the exact placing of methods. Also trying to make this work for the GroupIterator is not going well, and there it would have the highest impact. I need to change this more significantly or the alternative is that Serializable infects a whole lot of code.
lmdb query benchmark at 31c0933dfa989b2263c6a71ee107cf99979922f7
Benchmark Mode Cnt Score Error Units
QueryBenchmark.complexQuery avgt 5 71.214 ± 19.824 ms/op
QueryBenchmark.distinctPredicatesQuery avgt 5 230.005 ± 43.627 ms/op
QueryBenchmark.groupByQuery avgt 5 62.456 ± 5.115 ms/op
QueryBenchmark.pathExpressionQuery1 avgt 5 105.636 ± 3.124 ms/op
QueryBenchmark.removeByQuery avgt 5 293.808 ± 3.875 ms/op
QueryBenchmark.removeByQueryReadCommitted avgt 5 611.711 ± 26.506 ms/op
QueryBenchmark.simpleUpdateQueryIsolationNone avgt 5 534.920 ± 10.934 ms/op
QueryBenchmark.simpleUpdateQueryIsolationReadCommitted avgt 5 941.596 ± 48.650 ms/op
QueryBenchmarkFoaf.groupByCount avgt 5 2516.238 ± 97.970 ms/op
QueryBenchmarkFoaf.groupByCountSorted avgt 5 2417.892 ± 47.437 ms/op
QueryBenchmarkFoaf.personsAndFriends avgt 5 303.885 ± 3.198 ms/op
compared to develop at eab706d0d46cb340cb1f63c3098e42c802620a69
Benchmark Mode Cnt Score Error Units
QueryBenchmark.complexQuery avgt 5 19.973 ± 0.540 ms/op
QueryBenchmark.distinctPredicatesQuery avgt 5 178.691 ± 5.984 ms/op
QueryBenchmark.groupByQuery avgt 5 3.988 ± 0.248 ms/op
QueryBenchmark.pathExpressionQuery1 avgt 5 59.144 ± 0.901 ms/op
QueryBenchmark.removeByQuery avgt 5 284.295 ± 2.369 ms/op
QueryBenchmark.removeByQueryReadCommitted avgt 5 593.844 ± 23.566 ms/op
QueryBenchmark.simpleUpdateQueryIsolationNone avgt 5 418.575 ± 4.357 ms/op
QueryBenchmark.simpleUpdateQueryIsolationReadCommitted avgt 5 826.619 ± 44.657 ms/op
QueryBenchmarkFoaf.groupByCount avgt 5 3606.748 ± 49.783 ms/op
QueryBenchmarkFoaf.groupByCountSorted avgt 5 3471.768 ± 32.543 ms/op
QueryBenchmarkFoaf.personsAndFriends avgt 5 246.236 ± 4.751 ms/op
Big regressions on some, with big gains on others. I suspect the loss is due to always creating a mapdb database per query execution. This should be at least lazy, also more mapdb instances might be running.
This depends in part on the quality of the hashcode for the lmdb values. The first version was not great, the current one is better but still not great.
To make the equals test use the functions they need to be serializable because they need to be stored in mapdb. This is not nice, and I think that part of the code should be reverted.
@JervenBolleman Thank you for investigating this. Perhaps real improvements will be visible if the data sets get larger and don't fit into the available RAM. Most of our benchmarks more or less just test the in-memory case.
@kenwenzel I think it shows there is a big potential gain. On the other hand this code is not nearly good enough to be merged. On my home machine I have a benchmark run with great results, but that is with the enabling of the lazy mapdb mode. On the other hand the sync setting is not passed on so we always use an in memory hashmap. Quite a lot to still to be fixed. Especially around the setup of these collection factories.
at 847dbe415a8b4328f1df77e51389dd7b5e98bb84
QueryBenchmark.complexQuery avgt 5 17.374 ± 0.219 ms/op
QueryBenchmark.distinctPredicatesQuery avgt 5 167.387 ± 3.132 ms/op
QueryBenchmark.groupByQuery avgt 5 3.639 ± 0.135 ms/op
QueryBenchmark.pathExpressionQuery1 avgt 5 59.854 ± 1.249 ms/op
QueryBenchmark.removeByQuery avgt 5 284.124 ± 3.513 ms/op
QueryBenchmark.removeByQueryReadCommitted avgt 5 616.349 ± 8.395 ms/op
QueryBenchmark.simpleUpdateQueryIsolationNone avgt 5 418.440 ± 5.842 ms/op
QueryBenchmark.simpleUpdateQueryIsolationReadCommitted avgt 5 838.561 ± 23.125 ms/op
QueryBenchmarkFoaf.groupByCount avgt 5 2571.679 ± 58.877 ms/op
QueryBenchmarkFoaf.groupByCountSorted avgt 5 2347.194 ± 40.125 ms/op
QueryBenchmarkFoaf.personsAndFriends avgt 5 249.253 ± 1.353 ms/op
I need to verify here that the mapdb worked as it was supposed too.
Cool! That looks really promising. Have you also thought about caching of intermediate results (with MapDb)? This seems to be one of the reasons why QLever answers some queries fast.
Ok @kenwenzel this is a real improvement. Here the config on both the develop and this branch is to have a HashSet as the GroupBy iterator key datastructure. I noticed that the iterationCacheSyncThreshold was not read into the LmdbStore from it's config.
Regarding MapDB I am thinking of replacing it by lmdb and also to remove all the serialization logic, and replace it by something custom to a store.
@JervenBolleman A class that may be used as a blueprint can be found at: https://github.com/eclipse/rdf4j/blob/main/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/TxnRecordCache.java
I pushed a commit but won't be able to show the benchmark results until tomorrow morning. I think it will give better result as it really allows avoiding the retrieval of actual value contents from the lmdb value store. The trick is to introduce a createSetOfBindingSets
method. As this is what MapDB is currently used for. The buildEntries method always uses a in memory LinkedHashSet
( as an aside @jeenbroekstra do you know why a LinkedHashSet?)
@kenwenzel I am going to try parametrizing the Foaf benchmark to use or not use the MapDB options. To see what the impact here is.
In any case the experience of coding this makes me feel that maybe the maven pom should be part of the query-evaluation layer or close enough to it.
This branch at 9233e6334f7f11b455f94fa06120f243c6fe7962
QueryBenchmarkFoaf.groupByCount 0 avgt 5 2667.477 ± 388.056 ms/op
QueryBenchmarkFoaf.groupByCount 128 avgt 5 28265.484 ± 1768.305 ms/op
QueryBenchmarkFoaf.groupByCount 1024 avgt 5 13780.471 ± 2339.466 ms/op
QueryBenchmarkFoaf.groupByCountSorted 0 avgt 5 2396.928 ± 132.828 ms/op
QueryBenchmarkFoaf.groupByCountSorted 128 avgt 5 26570.518 ± 2586.584 ms/op
QueryBenchmarkFoaf.groupByCountSorted 1024 avgt 5 12096.365 ± 2074.652 ms/op
QueryBenchmarkFoaf.personsAndFriends 0 avgt 5 254.298 ± 1.851 ms/op
QueryBenchmarkFoaf.personsAndFriends 128 avgt 5 250.539 ± 3.906 ms/op
QueryBenchmarkFoaf.personsAndFriends 1024 avgt 5 250.280 ± 1.573 ms/op
So we use MapDB with 2 sync settings and once pure in mem with setting 0.
I need to rerun this on develop and with the 2524247 as well.
This is at develop plus the two patches
0 == inmem
128 and 1024 are two sync/commit every x mapdb
QueryBenchmarkFoaf.groupByCount 0 avgt 5 3667.453 ± 77.852 ms/op
QueryBenchmarkFoaf.groupByCount 128 avgt 5 12246.899 ± 5715.439 ms/op
QueryBenchmarkFoaf.groupByCount 1024 avgt 5 11985.620 ± 1197.471 ms/op
QueryBenchmarkFoaf.groupByCountSorted 0 avgt 5 4658.632 ± 1781.128 ms/op
QueryBenchmarkFoaf.groupByCountSorted 128 avgt 5 9865.982 ± 2067.274 ms/op
QueryBenchmarkFoaf.groupByCountSorted 1024 avgt 5 12274.115 ± 1634.895 ms/op
QueryBenchmarkFoaf.personsAndFriends 0 avgt 5 252.251 ± 7.844 ms/op
QueryBenchmarkFoaf.personsAndFriends 128 avgt 5 255.243 ± 3.756 ms/op
QueryBenchmarkFoaf.personsAndFriends 1024 avgt 5 248.170 ± 2.815 ms/op
We see the 33% speed up for the in memory usecase but using mapdb we doubled our timing. That is not what we want so I need to investigate what is causing that slow down.
@kenwenzel sorry I had pasted the wrong results into the comment. Now updated.
@JervenBolleman Thank you for the detailed benchmarks. You are speeking about doubled execution times using MapDB. To what figures are you relating these doubled times?
We can at least reduce the fsync happiness of MapDb and that seems to have a large effect which is expectable. Still that should be an independent patch against mainline or c2e8b24 does not prove itself being worth it.
QueryBenchmarkFoaf.groupByCount 0 avgt 5 2692.399 ± 604.772 ms/op
QueryBenchmarkFoaf.groupByCount 128 avgt 5 12762.410 ± 1074.142 ms/op
QueryBenchmarkFoaf.groupByCount 1024 avgt 5 11399.744 ± 2752.997 ms/op
QueryBenchmarkFoaf.groupByCount 2048 avgt 5 11088.300 ± 540.409 ms/op
QueryBenchmarkFoaf.groupByCountSorted 0 avgt 5 2503.945 ± 112.311 ms/op
QueryBenchmarkFoaf.groupByCountSorted 128 avgt 5 11091.863 ± 1000.285 ms/op
QueryBenchmarkFoaf.groupByCountSorted 1024 avgt 5 9380.511 ± 554.131 ms/op
QueryBenchmarkFoaf.groupByCountSorted 2048 avgt 5 9138.366 ± 341.620 ms/op
Ok code clean up some tests. And a reason for the earlier regression. So it looks like the regression happened because the better hashcode made the db a bit bigger which affected the 128 set benchmark on my machine as it just passed my sata disk cache in size.
Anyway with the new Serialization code we end up with a smaller database and a bit better performance. We are showing the limits of what can be achieved in this benchmark due to db size. We need larger group by with more patalogical use cases to see what is really going on.
QueryBenchmarkFoaf.groupByCount 0 avgt 5 2518.320 ± 174.551 ms/op
QueryBenchmarkFoaf.groupByCount 128 avgt 5 10246.649 ± 157.781 ms/op
QueryBenchmarkFoaf.groupByCount 1024 avgt 5 8573.631 ± 499.651 ms/op
QueryBenchmarkFoaf.groupByCount 2048 avgt 5 8290.967 ± 134.211 ms/op
QueryBenchmarkFoaf.groupByCountSorted 0 avgt 5 2345.347 ± 111.988 ms/op
QueryBenchmarkFoaf.groupByCountSorted 128 avgt 5 9978.101 ± 361.823 ms/op
QueryBenchmarkFoaf.groupByCountSorted 1024 avgt 5 8361.297 ± 706.103 ms/op
QueryBenchmarkFoaf.groupByCountSorted 2048 avgt 5 8180.156 ± 403.334 ms/op
The performance improvements are decent enough with the latest code.
Benchmark (mapDbCommit) Mode Cnt Score Error Units
QueryBenchmark.complexQuery N/A avgt 5 17.767 ± 0.550 ms/op
QueryBenchmark.distinctPredicatesQuery N/A avgt 5 166.903 ± 1.198 ms/op
QueryBenchmark.groupByQuery N/A avgt 5 3.914 ± 0.193 ms/op
QueryBenchmark.pathExpressionQuery1 N/A avgt 5 58.529 ± 3.258 ms/op
QueryBenchmark.removeByQuery N/A avgt 5 286.105 ± 8.861 ms/op
QueryBenchmark.removeByQueryReadCommitted N/A avgt 5 602.246 ± 6.867 ms/op
QueryBenchmark.simpleUpdateQueryIsolationNone N/A avgt 5 417.161 ± 2.646 ms/op
QueryBenchmark.simpleUpdateQueryIsolationReadCommitted N/A avgt 5 837.770 ± 14.059 ms/op
QueryBenchmarkFoaf.groupByCount 0 avgt 5 2604.481 ± 74.831 ms/op
QueryBenchmarkFoaf.groupByCount 128 avgt 5 10030.501 ± 209.233 ms/op
QueryBenchmarkFoaf.groupByCount 1024 avgt 5 8503.735 ± 79.267 ms/op
QueryBenchmarkFoaf.groupByCountSorted 0 avgt 5 2433.697 ± 159.198 ms/op
QueryBenchmarkFoaf.groupByCountSorted 128 avgt 5 9833.347 ± 618.525 ms/op
QueryBenchmarkFoaf.groupByCountSorted 1024 avgt 5 8164.269 ± 53.863 ms/op
QueryBenchmarkFoaf.personsAndFriends 0 avgt 5 248.262 ± 4.540 ms/op
QueryBenchmarkFoaf.personsAndFriends 128 avgt 5 250.730 ± 3.168 ms/op
QueryBenchmarkFoaf.personsAndFriends 1024 avgt 5 247.654 ± 9.588 ms/op
I think with larger datasets the difference will be even larger.
I would at this point like somethoughts on how to best do the configuration of the Collection-Factory.
I've merged develop into this branch and also tried to optimize the code a bit based on the QueryBenchmark#groupByQuery()
in the MemoryStore. I've also opened a PR to show how we could remove the new method you've added to the Sail interface.
@JervenBolleman is this still a draft PR or are you happy with how it looks?
@hmottestad I still have a number of outstanding commits for this PR at home :) I would like to push those before having it merged into main.
There is a clean pull request https://github.com/eclipse/rdf4j/pull/4314 that I think is reviewable. Once that is merged we can start bringing in the performance increases from this PR.