kitodo-production icon indicating copy to clipboard operation
kitodo-production copied to clipboard

[hibernate-search] Implement indexing and search

Open matthias-ronge opened this issue 1 year ago • 3 comments

Issue #5760 2c), 2d), 2e) and 2f)

Follow-up pull request to #6218 (immediate diff)

The filters work. Metadata is indexed and searchable. The metadata search syntax works as documented in the wiki.

matthias-ronge avatar Oct 29 '24 16:10 matthias-ronge

Great work! Can you elaborate (short high level overview) how the metadata is indexed now in general? (For reference: https://github.com/kitodo/kitodo-production/issues/4266)

BartChris avatar Oct 29 '24 18:10 BartChris

Can you elaborate (short high level overview) how the metadata is indexed now in general?

The metadata is stored in a text field that is indexed. The search is later carried out on the text field.

"Pseudo words" are used to search on certain metadata fields, i.e. if the word Berlin is in the TitleDocMain field in the metadata, a pseudo word "titledocmainqberlin" is also indexed. If a user later enters "TitleDocMain:Berlin" in the search field, this is converted into the pseudo word and searched for.

Pseudo words are generated for metadata keys, translated metadata keys according to the rule set, and domains according to the ruleset. The user can therefore also search for "Haupttitel:Berlin" (if the translation in the ruleset is this).

matthias-ronge avatar Oct 30 '24 09:10 matthias-ronge

Pseudo words are generated for metadata keys, translated metadata keys according to the rule set, and domains according to the ruleset. The user can therefore also search for "Haupttitel:Berlin" (if the translation in the ruleset is this).

Thanks a lot, this is helpful! Do i understand you correctly that you make the fields searchable by their german and english label as well? So for example "Dokumenttyp:Monograph" as a possible search with the german label?

<key id="docType" use="docType">
           <label>Document Type</label>
           <label lang="de">Dokumenttyp</label>
</key>

And those combined terms (LABEL+VALUE) are all stored in the index? What happens if somebody changes the german label of a metadata key in the ruleset?

BartChris avatar Oct 30 '24 16:10 BartChris

Do i understand you correctly that you make the fields searchable by their german and english label as well? So for example "Dokumenttyp:Monograph" as a possible search with the german label?

Yes.

What happens if somebody changes the german label of a metadata key in the ruleset?

The search will only find the process with the old label, until the process is saved again, or all processes are re-indexed. Since the index is now only used for filters, re-indexing (or a missing index) won’t “break” the application any more, as it did in the past, it should be possible to run re-indexing in an environment where people work on without breaking something (but it will still make it slow meanwhile!)

matthias-ronge avatar Oct 31 '24 09:10 matthias-ronge

What happens if somebody changes the german label of a metadata key in the ruleset?

The search will only find the process with the old label, until the process is saved again, or all processes are re-indexed. Since the index is now only used for filters, re-indexing (or a missing index) won’t “break” the application any more, as it did in the past, it should be possible to run re-indexing in an environment where people work on without breaking something (but it will still make it slow meanwhile!)

I understand. I fear a situation, where correcting a typo in a label would require a reindex of everything, otherwise old data is only found with the misspelled label. I do not want to increase your workload too much, but would it be the option to allow the label-based-filters in the UI, but only index and search under the common key in the index? By e.g. having a mapping function, which maps the labels to the actual key and searches for that.

BartChris avatar Oct 31 '24 10:10 BartChris

Since the index is now only used for filters, re-indexing (or a missing index) won’t “break” the application any more, as it did in the past, it should be possible to run re-indexing in an environment where people work on without breaking something (but it will still make it slow meanwhile!)

A more general remark here. I have not really tested anything so far, but it appears to me that almost everything gets indexed. My assumption would be that functionality like displaying list of processes (with all the necessary icons, which require lookups) is faster from the index since we can store data there in denormalized form. It is nice, that we can run Kitodo without a running index for certain functionality but my intuition would be to always use the index if that gives us an edge over database-based retrieval. Maybe i am overlooking something, but it might be worth considering to rely more on the index to have better performance when displaying 100 processes at once, which is very slow in Kitodo 3.7. (which probably uses a mixture of index and database-based retrieval)

This is not necessarily adressed at the current PR, but more a general remark which can be considered.

BartChris avatar Oct 31 '24 11:10 BartChris

I fear a situation, where correcting a typo in a label would require a reindex of everything, otherwise old data is only found with the misspelled label.

In my experience, rulesets are rarely changed at all during a running project, except perhaps the addition of a field; and I know librarians are very good at not making typos.

would it be the option to allow the label-based-filters in the UI, but only index and search under the common key in the index? By e.g. having a mapping function, which maps the labels to the actual key and searches for that.

  1. The metadata is also indexed under the key string defined in the rule set, so you can always use the key name for the search.
  2. Yes, of course, that would be possible, but I think it goes against the purpose of using search engines. It would be possible to implement this, but it would significantly slow down search queries because all relevant rulesets would have to be determined each time and checked to see whether a field name needs to be translated. I don't think that would be useful.

Perhaps it would be interesting to discuss this case among users. Such a change can be introduced later.

matthias-ronge avatar Oct 31 '24 13:10 matthias-ronge

it appears to me that almost everything gets indexed.

Yes, a lot of indexing is still going on at the moment and I don't think that's necessary. But the task here was not to make more than necessary changes, but to keep the application as most as it was before. But I think that in a separate development, it will be possible to clean this up so that in the end only processes and tasks are indexed, as only these contain metadata. The other objects are never searched, so they don’t do something sensible in the index by now.

My assumption would be that functionality like displaying list of processes […] is faster from the index since we can store data there in denormalized form. […] Maybe i am overlooking something, but it might be worth considering to rely more on the index to have better performance when displaying 100 processes at once, which is very slow in Kitodo 3.7. (which probably uses a mixture of index and database-based retrieval)

No, it's exactly the other way round: In Production 3.7, the display is taken from the index alone, as you describe here as a wish. And it is slow. In this version, this behavior was exactly inverted and now everything except the keyword search is taken from the database. On my developer laptop, the application has become significantly faster as a result.

Don't ask me why it behaves like this. This is another example of how performance is less about well-intentioned programming than about actual measurements. I suspect that both database queries and restoring Java in-memory objects from database rows are so much more sophisticated, having been researched for several decades longer, that it beats out indexing.

matthias-ronge avatar Oct 31 '24 14:10 matthias-ronge

~~I get an exception when navigating to the processes list (after rebuilding the index via the updated "indexing" page):~~

Edit: I tested on an outdated version of this branch. With the latest commits the exception on the process list page disappears.

Instead, I now get an exception when trying to edit a project, though:

Bildschirmfoto 2024-12-02 um 13 20 43

solth avatar Dec 02 '24 11:12 solth

All the filters i enter are always duplicated in the filter menu, which also duplicates the conditions in the database query. image

The SQL query generated has two IN-conditions with the same process Ids.

BartChris avatar Dec 12 '24 19:12 BartChris

Requirement checklist; as discussed in the Community Board:

  1. Blue search box in the top left jumps to the normal process list and not a special list with less selectable actions

Works in general. Some problems

a) Searching in the top left box does not reset the process list filter, but appends. A search should reset the current search.

image

  1. Searching for processes

a) Searching for process:11 throws an exception. It seems to happen when i just type process:11 and press enter, without selecting process from the dropdown.

The same thing sometimes happens when searching for 11 in the top left search bar

image

After that it is not really possible to return to the process list without logging out.

b) The requirements outlined above seem to be working

  • character string separation at non-word boundaries, where a word consists of letters and numbers
  • every separated group can be searched using right truncation, only groups with at least three characters need to be indexed AND
  • every separated group can be searched using left truncation, only groups with at least three characters need to be indexed
  • but not both in combination

BartChris avatar Dec 17 '24 17:12 BartChris

~I get an exception when navigating to the processes list (after rebuilding the index via the updated "indexing" page):~

Edit: I tested on an outdated version of this branch. With the latest commits the exception on the process list page disappears.

Instead, I now get an exception when trying to edit a project, though:

Bildschirmfoto 2024-12-02 um 13 20 43

I get the same exception when trying to add a new process through the project list.

BartChris avatar Dec 17 '24 17:12 BartChris

When trying to create a process i still get

org.hibernate.LazyInitializationException: could not initialize proxy [org.kitodo.data.database.beans.ImportConfiguration#3] - no Session
	at org.hibernate.proxy.AbstractLazyInitializer.initialize(AbstractLazyInitializer.java:176)
	at org.hibernate.proxy.AbstractLazyInitializer.getImplementation(AbstractLazyInitializer.java:322)
	at org.hibernate.proxy.pojo.bytebuddy.ByteBuddyInterceptor.intercept(ByteBuddyInterceptor.java:45)
	at org.hibernate.proxy.ProxyConfiguration$InterceptorDispatcher.intercept(ProxyConfiguration.java:95)
	at org.kitodo.data.database.beans.ImportConfiguration$HibernateProxy$fWH8cidi.getConfigurationType(Unknown Source)
	at org.kitodo.production.forms.createprocess.CreateProcessForm.setDefaultImportConfiguration(CreateProcessForm.java:476)
	at org.kitodo.production.forms.createprocess.CreateProcessForm.prepareProcess(CreateProcessForm.java:431)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Do you consider the PR to be in draft state?

BartChris avatar Feb 11 '25 17:02 BartChris

The basic functionality should work. I was able to start the branch myself. Maybe not everything works, there can always be bugs that I need to fix. For a review (proofreading the source code) I think it's already good enough for you to look at it and many features can work and most of it works. For now I managed to fix all the integration tests. And the Selenium tests worked for me today, too. So far it's going well. After @solth it's just the final approval on the merge to the "main" branch, the conflicting commits still need to be merged and fixed. This can be done separately afterwards. I always check out the branch locally and let it run as far as possible.

matthias-ronge avatar Feb 12 '25 14:02 matthias-ronge

Regarding the errors mentioned by @BartChris:

  1. Yes, I hadn't considered that case and have just fixed it. (The filters that have already been converted to blue boxes are internally stored in a separate list that needs to be cleared. It wasn't enough to overwrite the filter slot. I wasn't aware of that.)

2a) I can't reproduce the error and I can't open the screenshot in its own window. I don't understand how the query … AND process.null = :userFilter1 OR process.null = :userFilter2 ) AND … got to this state. A search for 11 works for me.

2b) Does that mean it works as it should?

matthias-ronge avatar Feb 12 '25 15:02 matthias-ronge

Hi @solth, is that OK so far? Can you merge the PR? Then I could start merging up to main. I would like to finish the project. Or do you see any need for changes here? I don't like rebasing a branch with lots of merges later, this is why I ask. This is not an exclusion to fix any issues that arise before the final merge, just that I would then fix them on the updated branch.

matthias-ronge avatar Feb 19 '25 08:02 matthias-ronge

@matthias-ronge Even though there are still some issues that need to be addressed (for example potential performance problems) I think those can be discussed and resolved in the final pull request from the hibernate-search branch against the main branch. Since we need to get this project finished I will merge this pull request so that you can work on updating the hibernate-search branch with the changes introduced into the main branch.

Important: this does not mean that all review remarks from this pull request can be discarded. I want to be clear that it will still be legitimate to request conceptional changes in the final pull request, for example regarding the retrieval of many objects via huge SQL queries with thousands of IDs. I am not convinced that we cannot find a better way in that regard and expect we will still need changes at some other points as well.

@henning-gerhardt, @BartChris, @thomaslow, @oliver-stoehr please apply any further review remarks to the upcoming pull request from the hibernate-search branch against the main branch once @matthias-ronge deems it ready for review!

solth avatar Feb 19 '25 08:02 solth