kitodo-production icon indicating copy to clipboard operation
kitodo-production copied to clipboard

[hibernate-search] Introduce Hibernate Search framework and implement indexing page

Open matthias-ronge opened this issue 1 year ago • 6 comments

Issue #5760 2a) and 2b)

Follow-up pull request to #6209 (immediate diff)

Recording

The three numbers before the slash in “Indexed entries” represent the number of objects that Hibernate has already loaded from the database, the number of objects that have been prepared as indexable documents (JSONs), and finally the number of indexed documents.

Basic experience: Hibernate Search and lazy loading don't mix. It looks like we have to accept that. As a result, I have deactivated lazy loading wherever the number of members of a set is typically small (< 25). This affects most sets, e.g. projects of a template, tasks, users or properties of a template or a process, etc. If the set can typically be large (> 1000), the elements of the set are not indexed. Example: Processes of a batch. Consideration: If the number of subelements to be indexed in an object is very large, the findability of the object approaches infinity (it becomes increasingly likely that it will be found with any search query). Such indexing also makes the index enormously large. Therefore, it can be considered justifiable not to index these fields.

matthias-ronge avatar Sep 04 '24 14:09 matthias-ronge

@matthias-ronge : a hopefully short general question: is it possible to use different indices with Hibernate-Search? Currently this is possible through different values with the elasticsearch.index configuration. Is this or something similar still possible? I'm asking because I'm working with different Kitodo.Production versions which has separated meta data directories on my local file system, different databases in a MariaDB database and different search prefixes in a ElasticSearch instance. This must not working in the current state of the changes nor is this a current goal but maybe something for later?

henning-gerhardt avatar Sep 09 '24 11:09 henning-gerhardt

is it possible to use different indices with Hibernate-Search?

The index names for the individual objects are contained in the annotations as a string. I cannot estimate whether it is even possible to use variables here, or whether these have to be hard-coded strings at compile time; but I suspect the latter. Index access is controlled via properties such as port. You could install several index services on different ports and set the port at runtime before the program starts, or change the index data directory (as a symbolic link).

Such a feature is currently not in the scope of our development.

matthias-ronge avatar Sep 09 '24 13:09 matthias-ronge

Thank you @matthias-ronge for the explanation. I know and I did not expect that this usage scenario is part of the current development to use different hibernate search indices.

Edit: Maybe indexlayout-strategy-custom is a way to archive this. But this is nothing for now.

henning-gerhardt avatar Sep 09 '24 13:09 henning-gerhardt

@matthias-ronge I checked out your branch, and took notes of my testing experience.

  1. I built a new war file, and deployed it to Tomcat. At first, there was an error message in the log stating that hibernate-search could not connect to my elastic search instance (which is fine, because I do not run it on localhost).

Unable to detect the Elasticsearch version running on the cluster: HSEARCH400007: Elasticsearch request failed: Connection refused

  1. Then, I copied the hibernate.properties into my config-local directory, and changed the host name. Now the application starts without any error messages.

  2. I tried to log in to kitodo-production with my admin account. Nothing happens. The page keeps loading forever. No errors, but lots of CPU activity for mariadb and tomcat. Maybe indexes are being created in the background? But there is no user interface or message.

  3. After ~15minutes, the kitodo dashboard is shown, but my CPU is still active. The System - Indexing page does not show any progress. Only 0% everywhere. After ~30 minutes without any page loading, the CPU load is normal again. Maybe disabling lazy-loading triggers thousands of database queries when loading the dashboard? My test database contains ~80.000 processes.

Unfortunately, at this state, it is not possible to do further testing.

@matthias-ronge In case you have not done this yet, please test your branch with a large amount of test data. Otherwise, let me know, and I will try to figure out why pages are loading so slowly on my machine.

thomaslow avatar Oct 21 '24 10:10 thomaslow

I tried to start the indexing. Some entities were indexed within a few seconds. The remaining entities (processes, projects, tasks, templates) stay at 0% for at least the last 5 minutes.

image

After ~10 minutes all entities except processes and tasks were indexed at 100%. Processes and tasks have only 60 indexed entities (of 80.000 and 4.000 respectively).

thomaslow avatar Oct 21 '24 10:10 thomaslow

Thank you for this testing and your insights. However, this is not as I expected. I have not tested with such large data yet, I will have to inspect it myself first. General assumption is that framework works reasonably well, it could be due to some small thing. If I can confirm it works for large data, I will let you know.

The code is not manually creating an index at startup, but I also saw it delay first, but only a few seconds. It is clear that I have to check this.

matthias-ronge avatar Oct 28 '24 12:10 matthias-ronge

4. Maybe disabling lazy-loading triggers thousands of database queries when loading the dashboard? My test database contains ~80.000 processes.

I logged the SQL statements having checked out the branch and just scrolling through the list of processes (10 per page) floods my database with queries. I have around 1000 processes in my database. It takes very long to jump to the next 10 entries.

Hundreds of requests are made for one page:

2024-11-04T09:01:04.540144Z	   23 Query	rollback
2024-11-04T09:01:04.540190Z	   23 Query	SET autocommit=1
2024-11-04T09:01:04.540239Z	   22 Query	SET autocommit=0
2024-11-04T09:01:04.540308Z	   22 Query	select batches0_.process_id as process_2_2_0_, batches0_.batch_id as batch_id1_2_0_, batch1_.id as id1_1_1_, batch1_.title as title2_1_1_, batch1_.type as type3_1_1_ from batch_x_process batches0_ inner join batch batch1_ on batches0_.batch_id=batch1_.id where batches0_.process_id=2310
2024-11-04T09:01:04.540415Z	   22 Query	rollback
2024-11-04T09:01:04.540450Z	   22 Query	SET autocommit=1
2024-11-04T09:01:04.540493Z	   23 Query	SET autocommit=0
2024-11-04T09:01:04.540573Z	   23 Query	select workpieces0_.process_id as process_1_36_0_, workpieces0_.property_id as property2_36_0_, property1_.id as id1_22_1_, property1_.choice as choice2_22_1_, property1_.creationDate as creation3_22_1_, property1_.dataType as datatype4_22_1_, property1_.obligatory as obligato5_22_1_, property1_.title as title6_22_1_, property1_.value as value7_22_1_ from workpiece_x_property workpieces0_ inner join property property1_ on workpieces0_.property_id=property1_.id where workpieces0_.process_id=2309
2024-11-04T09:01:04.540918Z	   23 Query	rollback
2024-11-04T09:01:04.540958Z	   23 Query	SET autocommit=1
2024-11-04T09:01:04.541003Z	   22 Query	SET autocommit=0
2024-11-04T09:01:04.541085Z	   22 Query	select templates0_.process_id as process_1_30_0_, templates0_.property_id as property2_30_0_, property1_.id as id1_22_1_, property1_.choice as choice2_22_1_, property1_.creationDate as creation3_22_1_, property1_.dataType as datatype4_22_1_, property1_.obligatory as obligato5_22_1_, property1_.title as title6_22_1_, property1_.value as value7_22_1_ from template_x_property templates0_ inner join property property1_ on templates0_.property_id=property1_.id where templates0_.process_id=2309
2024-11-04T09:01:04.541188Z	   22 Query	rollback
2024-11-04T09:01:04.541221Z	   22 Query	SET autocommit=1
2024-11-04T09:01:04.541262Z	   23 Query	SET autocommit=0

from time to time (while issuing many smaller queries as well) really complex queries are fired.

image

BartChris avatar Nov 04 '24 09:11 BartChris

I can reproduce the error: For me it doesn't start with a larger database (8000 processes) either - or rather, it's still taking a while, I'm just waiting. I don't know why that is, it must be coming from the framework. It's not any code that I programmed that is being executed. I don't think it's good that it takes so long to start. I'm just waiting.

matthias-ronge avatar Nov 12 '24 15:11 matthias-ronge

Since we need to make progress on this front and because the changes in this pull request are a subset of the changes in #6283 I would like to merge this pull request if noone has any larger, conceptional concerns about the proposed changes in general. Of course the performance problems need to be resolved, but in my opinion this can happen when the final pull request is opened against the master branch. (since this pull request is only made against the hibernate-search branch)

@thomaslow , @BartChris & @henning-gerhardt would you agree?

solth avatar Nov 26 '24 09:11 solth

I agree with you @solth The hibernate-search branch needs polishing before a merge to the master branch is done.

henning-gerhardt avatar Nov 26 '24 10:11 henning-gerhardt