rdf4j GH-3843 collection factory

GitHub issue resolved: #3843

PR Author Checklist (see the contributor guidelines for more details):

[x] my pull request is self-contained
[ ] I've added tests for the changes I made
[x] I've applied code formatting (you can use mvn process-resources to format from the command line)
[x] I've squashed my commits where necessary
[x] every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

Apr 28 '22 15:04 JervenBolleman

@kenwenzel would you mind having a look at this? See if you like the ideas.

Apr 28 '22 16:04 JervenBolleman

@kenwenzel would you mind having a look at this? See if you like the ideas.

@JervenBolleman I like the idea of making the collections configurable for different SAIL implementations. But currently, I can't oversee what collection types are required. In addition to changing the entire collection implementations, we could use a mapper interface to convert between SAIL-specific values and IDs.

Apr 29 '22 06:04 kenwenzel

Looks like github action for this pull request is hanging. I think I will close it in a bit and reopen it.

Apr 29 '22 07:04 JervenBolleman

I am now at the point of actually wiring this up into the evaluation strategy. I think I don't have a good API placement or at least I am not to happy with the exact placing of methods. Also trying to make this work for the GroupIterator is not going well, and there it would have the highest impact. I need to change this more significantly or the alternative is that Serializable infects a whole lot of code.

May 09 '22 08:05 JervenBolleman

lmdb query benchmark at 31c0933dfa989b2263c6a71ee107cf99979922f7

Benchmark                                               Mode  Cnt     Score    Error  Units
QueryBenchmark.complexQuery                             avgt    5    71.214 ± 19.824  ms/op
QueryBenchmark.distinctPredicatesQuery                  avgt    5   230.005 ± 43.627  ms/op
QueryBenchmark.groupByQuery                             avgt    5    62.456 ±  5.115  ms/op
QueryBenchmark.pathExpressionQuery1                     avgt    5   105.636 ±  3.124  ms/op
QueryBenchmark.removeByQuery                            avgt    5   293.808 ±  3.875  ms/op
QueryBenchmark.removeByQueryReadCommitted               avgt    5   611.711 ± 26.506  ms/op
QueryBenchmark.simpleUpdateQueryIsolationNone           avgt    5   534.920 ± 10.934  ms/op
QueryBenchmark.simpleUpdateQueryIsolationReadCommitted  avgt    5   941.596 ± 48.650  ms/op
QueryBenchmarkFoaf.groupByCount                         avgt    5  2516.238 ± 97.970  ms/op
QueryBenchmarkFoaf.groupByCountSorted                   avgt    5  2417.892 ± 47.437  ms/op
QueryBenchmarkFoaf.personsAndFriends                    avgt    5   303.885 ±  3.198  ms/op

compared to develop at eab706d0d46cb340cb1f63c3098e42c802620a69

Benchmark                                               Mode  Cnt     Score    Error  Units
QueryBenchmark.complexQuery                             avgt    5    19.973 ±  0.540  ms/op
QueryBenchmark.distinctPredicatesQuery                  avgt    5   178.691 ±  5.984  ms/op
QueryBenchmark.groupByQuery                             avgt    5     3.988 ±  0.248  ms/op
QueryBenchmark.pathExpressionQuery1                     avgt    5    59.144 ±  0.901  ms/op
QueryBenchmark.removeByQuery                            avgt    5   284.295 ±  2.369  ms/op
QueryBenchmark.removeByQueryReadCommitted               avgt    5   593.844 ± 23.566  ms/op
QueryBenchmark.simpleUpdateQueryIsolationNone           avgt    5   418.575 ±  4.357  ms/op
QueryBenchmark.simpleUpdateQueryIsolationReadCommitted  avgt    5   826.619 ± 44.657  ms/op
QueryBenchmarkFoaf.groupByCount                         avgt    5  3606.748 ± 49.783  ms/op
QueryBenchmarkFoaf.groupByCountSorted                   avgt    5  3471.768 ± 32.543  ms/op
QueryBenchmarkFoaf.personsAndFriends                    avgt    5   246.236 ±  4.751  ms/op

Big regressions on some, with big gains on others. I suspect the loss is due to always creating a mapdb database per query execution. This should be at least lazy, also more mapdb instances might be running.

This depends in part on the quality of the hashcode for the lmdb values. The first version was not great, the current one is better but still not great.

To make the equals test use the functions they need to be serializable because they need to be stored in mapdb. This is not nice, and I think that part of the code should be reverted.

May 09 '22 15:05 JervenBolleman

@JervenBolleman Thank you for investigating this. Perhaps real improvements will be visible if the data sets get larger and don't fit into the available RAM. Most of our benchmarks more or less just test the in-memory case.

May 10 '22 06:05 kenwenzel

@kenwenzel I think it shows there is a big potential gain. On the other hand this code is not nearly good enough to be merged. On my home machine I have a benchmark run with great results, but that is with the enabling of the lazy mapdb mode. On the other hand the sync setting is not passed on so we always use an in memory hashmap. Quite a lot to still to be fixed. Especially around the setup of these collection factories.

May 10 '22 09:05 JervenBolleman

at 847dbe415a8b4328f1df77e51389dd7b5e98bb84

QueryBenchmark.complexQuery                             avgt    5    17.374 ±  0.219  ms/op
QueryBenchmark.distinctPredicatesQuery                  avgt    5   167.387 ±  3.132  ms/op
QueryBenchmark.groupByQuery                             avgt    5     3.639 ±  0.135  ms/op
QueryBenchmark.pathExpressionQuery1                     avgt    5    59.854 ±  1.249  ms/op
QueryBenchmark.removeByQuery                            avgt    5   284.124 ±  3.513  ms/op
QueryBenchmark.removeByQueryReadCommitted               avgt    5   616.349 ±  8.395  ms/op
QueryBenchmark.simpleUpdateQueryIsolationNone           avgt    5   418.440 ±  5.842  ms/op
QueryBenchmark.simpleUpdateQueryIsolationReadCommitted  avgt    5   838.561 ± 23.125  ms/op
QueryBenchmarkFoaf.groupByCount                         avgt    5  2571.679 ± 58.877  ms/op
QueryBenchmarkFoaf.groupByCountSorted                   avgt    5  2347.194 ± 40.125  ms/op
QueryBenchmarkFoaf.personsAndFriends                    avgt    5   249.253 ±  1.353  ms/op

I need to verify here that the mapdb worked as it was supposed too.

May 10 '22 19:05 JervenBolleman

Cool! That looks really promising. Have you also thought about caching of intermediate results (with MapDb)? This seems to be one of the reasons why QLever answers some queries fast.

May 10 '22 20:05 kenwenzel

Ok @kenwenzel this is a real improvement. Here the config on both the develop and this branch is to have a HashSet as the GroupBy iterator key datastructure. I noticed that the iterationCacheSyncThreshold was not read into the LmdbStore from it's config.

Regarding MapDB I am thinking of replacing it by lmdb and also to remove all the serialization logic, and replace it by something custom to a store.

May 10 '22 20:05 JervenBolleman

@JervenBolleman A class that may be used as a blueprint can be found at: https://github.com/eclipse/rdf4j/blob/main/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/TxnRecordCache.java

May 11 '22 07:05 kenwenzel

I pushed a commit but won't be able to show the benchmark results until tomorrow morning. I think it will give better result as it really allows avoiding the retrieval of actual value contents from the lmdb value store. The trick is to introduce a createSetOfBindingSets method. As this is what MapDB is currently used for. The buildEntries method always uses a in memory LinkedHashSet ( as an aside @jeenbroekstra do you know why a LinkedHashSet?)

@kenwenzel I am going to try parametrizing the Foaf benchmark to use or not use the MapDB options. To see what the impact here is.

In any case the experience of coding this makes me feel that maybe the maven pom should be part of the query-evaluation layer or close enough to it.

May 11 '22 14:05 JervenBolleman

This branch at 9233e6334f7f11b455f94fa06120f243c6fe7962

QueryBenchmarkFoaf.groupByCount                                     0  avgt    5   2667.477 ±  388.056  ms/op
QueryBenchmarkFoaf.groupByCount                                   128  avgt    5  28265.484 ± 1768.305  ms/op
QueryBenchmarkFoaf.groupByCount                                  1024  avgt    5  13780.471 ± 2339.466  ms/op
QueryBenchmarkFoaf.groupByCountSorted                               0  avgt    5   2396.928 ±  132.828  ms/op
QueryBenchmarkFoaf.groupByCountSorted                             128  avgt    5  26570.518 ± 2586.584  ms/op
QueryBenchmarkFoaf.groupByCountSorted                            1024  avgt    5  12096.365 ± 2074.652  ms/op
QueryBenchmarkFoaf.personsAndFriends                                0  avgt    5    254.298 ±    1.851  ms/op
QueryBenchmarkFoaf.personsAndFriends                              128  avgt    5    250.539 ±    3.906  ms/op
QueryBenchmarkFoaf.personsAndFriends                             1024  avgt    5    250.280 ±    1.573  ms/op

So we use MapDB with 2 sync settings and once pure in mem with setting 0.

I need to rerun this on develop and with the 2524247 as well.

May 12 '22 06:05 JervenBolleman

This is at develop plus the two patches

                                                                    0 == inmem
                                                                    128 and 1024 are two sync/commit every x mapdb
QueryBenchmarkFoaf.groupByCount                                     0  avgt    5   3667.453 ±   77.852  ms/op
QueryBenchmarkFoaf.groupByCount                                   128  avgt    5  12246.899 ± 5715.439  ms/op
QueryBenchmarkFoaf.groupByCount                                  1024  avgt    5  11985.620 ± 1197.471  ms/op
QueryBenchmarkFoaf.groupByCountSorted                               0  avgt    5   4658.632 ± 1781.128  ms/op
QueryBenchmarkFoaf.groupByCountSorted                             128  avgt    5   9865.982 ± 2067.274  ms/op
QueryBenchmarkFoaf.groupByCountSorted                            1024  avgt    5  12274.115 ± 1634.895  ms/op
QueryBenchmarkFoaf.personsAndFriends                                0  avgt    5    252.251 ±    7.844  ms/op
QueryBenchmarkFoaf.personsAndFriends                              128  avgt    5    255.243 ±    3.756  ms/op
QueryBenchmarkFoaf.personsAndFriends                             1024  avgt    5    248.170 ±    2.815  ms/op

We see the 33% speed up for the in memory usecase but using mapdb we doubled our timing. That is not what we want so I need to investigate what is causing that slow down.

@kenwenzel sorry I had pasted the wrong results into the comment. Now updated.

May 12 '22 17:05 JervenBolleman

@JervenBolleman Thank you for the detailed benchmarks. You are speeking about doubled execution times using MapDB. To what figures are you relating these doubled times?

May 12 '22 17:05 kenwenzel

We can at least reduce the fsync happiness of MapDb and that seems to have a large effect which is expectable. Still that should be an independent patch against mainline or c2e8b24 does not prove itself being worth it.

QueryBenchmarkFoaf.groupByCount                                     0  avgt    5   2692.399 ±  604.772  ms/op
QueryBenchmarkFoaf.groupByCount                                   128  avgt    5  12762.410 ± 1074.142  ms/op
QueryBenchmarkFoaf.groupByCount                                  1024  avgt    5  11399.744 ± 2752.997  ms/op
QueryBenchmarkFoaf.groupByCount                                  2048  avgt    5  11088.300 ±  540.409  ms/op
QueryBenchmarkFoaf.groupByCountSorted                               0  avgt    5   2503.945 ±  112.311  ms/op
QueryBenchmarkFoaf.groupByCountSorted                             128  avgt    5  11091.863 ± 1000.285  ms/op
QueryBenchmarkFoaf.groupByCountSorted                            1024  avgt    5   9380.511 ±  554.131  ms/op
QueryBenchmarkFoaf.groupByCountSorted                            2048  avgt    5   9138.366 ±  341.620  ms/op

May 14 '22 18:05 JervenBolleman

Ok code clean up some tests. And a reason for the earlier regression. So it looks like the regression happened because the better hashcode made the db a bit bigger which affected the 128 set benchmark on my machine as it just passed my sata disk cache in size.

Anyway with the new Serialization code we end up with a smaller database and a bit better performance. We are showing the limits of what can be achieved in this benchmark due to db size. We need larger group by with more patalogical use cases to see what is really going on.

QueryBenchmarkFoaf.groupByCount                                     0  avgt    5   2518.320 ± 174.551  ms/op
QueryBenchmarkFoaf.groupByCount                                   128  avgt    5  10246.649 ± 157.781  ms/op
QueryBenchmarkFoaf.groupByCount                                  1024  avgt    5   8573.631 ± 499.651  ms/op
QueryBenchmarkFoaf.groupByCount                                  2048  avgt    5   8290.967 ± 134.211  ms/op
QueryBenchmarkFoaf.groupByCountSorted                               0  avgt    5   2345.347 ± 111.988  ms/op
QueryBenchmarkFoaf.groupByCountSorted                             128  avgt    5   9978.101 ± 361.823  ms/op
QueryBenchmarkFoaf.groupByCountSorted                            1024  avgt    5   8361.297 ± 706.103  ms/op
QueryBenchmarkFoaf.groupByCountSorted                            2048  avgt    5   8180.156 ± 403.334  ms/op

May 23 '22 20:05 JervenBolleman

The performance improvements are decent enough with the latest code.

Benchmark                                               (mapDbCommit)  Mode  Cnt      Score     Error  Units
QueryBenchmark.complexQuery                                       N/A  avgt    5     17.767 ±   0.550  ms/op
QueryBenchmark.distinctPredicatesQuery                            N/A  avgt    5    166.903 ±   1.198  ms/op
QueryBenchmark.groupByQuery                                       N/A  avgt    5      3.914 ±   0.193  ms/op
QueryBenchmark.pathExpressionQuery1                               N/A  avgt    5     58.529 ±   3.258  ms/op
QueryBenchmark.removeByQuery                                      N/A  avgt    5    286.105 ±   8.861  ms/op
QueryBenchmark.removeByQueryReadCommitted                         N/A  avgt    5    602.246 ±   6.867  ms/op
QueryBenchmark.simpleUpdateQueryIsolationNone                     N/A  avgt    5    417.161 ±   2.646  ms/op
QueryBenchmark.simpleUpdateQueryIsolationReadCommitted            N/A  avgt    5    837.770 ±  14.059  ms/op
QueryBenchmarkFoaf.groupByCount                                     0  avgt    5   2604.481 ±  74.831  ms/op
QueryBenchmarkFoaf.groupByCount                                   128  avgt    5  10030.501 ± 209.233  ms/op
QueryBenchmarkFoaf.groupByCount                                  1024  avgt    5   8503.735 ±  79.267  ms/op
QueryBenchmarkFoaf.groupByCountSorted                               0  avgt    5   2433.697 ± 159.198  ms/op
QueryBenchmarkFoaf.groupByCountSorted                             128  avgt    5   9833.347 ± 618.525  ms/op
QueryBenchmarkFoaf.groupByCountSorted                            1024  avgt    5   8164.269 ±  53.863  ms/op
QueryBenchmarkFoaf.personsAndFriends                                0  avgt    5    248.262 ±   4.540  ms/op
QueryBenchmarkFoaf.personsAndFriends                              128  avgt    5    250.730 ±   3.168  ms/op
QueryBenchmarkFoaf.personsAndFriends                             1024  avgt    5    247.654 ±   9.588  ms/op

I think with larger datasets the difference will be even larger.

I would at this point like somethoughts on how to best do the configuration of the Collection-Factory.

Jun 03 '22 11:06 JervenBolleman

I've merged develop into this branch and also tried to optimize the code a bit based on the QueryBenchmark#groupByQuery() in the MemoryStore. I've also opened a PR to show how we could remove the new method you've added to the Sail interface.

Jun 27 '22 13:06 hmottestad

@JervenBolleman is this still a draft PR or are you happy with how it looks?

Jul 16 '22 09:07 hmottestad

@hmottestad I still have a number of outstanding commits for this PR at home :) I would like to push those before having it merged into main.

Aug 12 '22 16:08 JervenBolleman

There is a clean pull request https://github.com/eclipse/rdf4j/pull/4314 that I think is reviewable. Once that is merged we can start bringing in the performance increases from this PR.

Dec 13 '22 13:12 JervenBolleman

rdf4j rdf4j copied to clipboard

GH-3843 collection factory

rdf4j
rdf4j copied to clipboard