jabref Lucene search backend

First draft of the switch to the Lucene Search Backend

Whats working

Added all bib-fields to the index
BibEntry hash is used as an identifier in the index
Both fulltext and bib-fields are indexed together, in the same index, in the background. BibEntry-fields changes are prioritized as they are very quick.
Can see in the code that searches in bib-fields are working as expected
Switch current search to use the lucene backend
Sort search-results by lucene score. This means search will not filter the table anymore, but merely sort according to results
Global search (in contrast to main table, the global search window filters for matches with score > 0)
Change the file button-icon depending on whether there was a fulltext-search match for this entry
Search-Groups
Floating search
Switch to force sort by score or not

Todo

[ ] Remove old search
[x] Display x.y at score only (two decimal places)
- [ ] The cell itself should be colored. From green to red (all in light shade): 10=green, 0= red
[x] Decide whether to display score column. Currently, it shows when a search is active. Users probably cannot do anything with the values, though. Maybe use it to shade entries with high relevance?
- --> We keep the score column
[x] Use gray background for score==0. See https://github.com/JabRef/jabref/issues/4237#issuecomment-993325334.
- [ ] Scroll to top until hits are shown (scroll down 1 shows a gray line)

Follow-ups

Improve fulltext-search-results display
- A little improvement was already done. The file icon in the table is now different (shows a magnifying glass) when there were search results in a linked file.
- Could be improved even more by inserting a fulltext-results row below the actual row.
Improve query-entry (like #8356?)

Problems:

[x] The score-column is always used to sort the table as highest-priority. JabRef will only sort by two columns max. This means with the search-score being one of them, users can only sort by one column.
- Solution: add sort-columns with shift-click
- Solution: Sorting by score column is not forced but optional
[x] Everything is indexed with the JabRef field names, which are in English. This means, either we translate the fields when creating an index, or people need to use the english terms for search queries. Translating the fields in english would mean a re-index needs to be triggered whenever the language changes and JabRef would need to notice if the language changed while it was closed.
- Solution: leave as-is, search-syntax is English and was English before the migration to lucene
[ ] Search group migration.
- Lucene syntax is different for normal searches (: instead of = to search in a field)
- Regex should still work
- No more pseudo-fields
- [ ] maybe implement #9072 first, to ease the transition
- Should users deal with this themselves?

Fixes #8857 Peek 2022-08-17 20-01

[ ] Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
[ ] Tests created for changes (if applicable)
[ ] Manually tested changed features in running JabRef (always required)
[ ] Screenshots added in PR description (for UI changes)
[ ] Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
[ ] Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

Jul 09 '22 16:07 btut

Very nice to see this implemented! A few comments from my side:

I wouldn't bother with the migration of search groups. There is going to be friction anyway if users use advanced search functionality. Maybe show a info message if the Bib file contains search groups saying that those groups might not work as expected and need to be migrated manually?
What about showing the search score as a graphic similar to the battery or signal icon on the phone?
I have not yet tested this version, but I am a bit sceptical about implementing the search via sorting instead of filtering. I suspect that the search score will differ slightly between similar entries and thus the secondary sorting by the user will have no impact. A particular use case: Find all works by a certain author and sort them according to the read status column.
I'm definitely a big fan of adding the search score as a column, but would leave it to the user to sort the results according to the score.

Jul 22 '22 08:07 tobiasdiez

Very nice to see this implemented! A few comments from my side:

I wouldn't bother with the migration of search groups. There is going to be friction anyway if users use advanced search functionality. Maybe show a info message if the Bib file contains search groups saying that those groups might not work as expected and need to be migrated manually?

Would definitely reduce coding effort and migration code.

What about showing the search score as a graphic similar to the battery or signal icon on the phone?

Thought about that too, maybe with the option for power-users to show floats instead. I am very bad at UI design though, did not get good feedback whenever I implemented something, so would need some help with that.

I have not yet tested this version, but I am a bit sceptical about implementing the search via sorting instead of filtering. I suspect that the search score will differ slightly between similar entries and thus the secondary sorting by the user will have no impact. A particular use case: Find all works by a certain author and sort them according to the read status column.

Thats true, but filtering kind of defeats the purpose of a fulltext/fuzzy search IMO. Where do you draw the line for the filtering predicate? Score==0? In large libraries that will result in a table with lots of "bad" search results with scores slightly above 0 and people could need to scroll to find much better matches.

I'm definitely a big fan of adding the search score as a column, but would leave it to the user to sort the results according to the score.

In any case, we would need to have a very nice way to display the score if we do not sort on it, so users can still tell better fits apart.

Jul 22 '22 18:07 btut

Discussion note:

Screenshot with rows highlighted: https://martin-thoma.com/images/2014/06/jabref.png
There once was the marking feature which highlighed complete rows (this was removed, because we favour groups)

Aug 15 '22 19:08 koppor

Hey @AEgit, this PR brings back the 'floating' mode for search-results and groups. From other issues I know you are quite fond of that feature and you are even sticking to older JabRef version to preserve it? I would therefore really appreciate your feedback on how exactly bring the feature back.

This is how JabRef look as of this PR. You have a filter-toggle both for the search and for groups. By default, these toggles are enabled, so the UI reflects the current version of JabRef. If you disable the toggles, entries that are not part of the selected group(s) and/or entries that do not match the search query are NOT filtered out any more. In the screenshot above both toggles are disabled so you see all entries. Entries that do not match the search query have low opacity, entries that are not in the selected group have gray background.

Would this satisfy your needs?

I like the UI when only one of the toggles is disabled. Then the display is quite clear. However, I am quite unhappy with how the UI looks when both are disabled (as in the screenshot). Maybe you have a better idea that would work best for your use-case? General feedback on the usability is also highly appreciated.

Sep 04 '22 12:09 btut

@btut Thank you very much, glad to see this project is moving forward. Yes, as you mentioned, I absolutely require the floating mode functionality. So, for my daily work I stick to JabRef 3.8.2 and I only run the newer JabRef versions to help with bug testing.

I would like JabRef to have the same floating functionality that vesion 3.8.2 had. That entails the following:

If I do not use the search and just select a group from the groups panel, then the following should happen: (a) All articles remain visible in the main table view (unlike what happens now, where all articles that are not part of the group become invisible) (b) The articles that are part of the group all move to the top of the main table view (c) All articles that do not belong to the group receive a gray colour in the main table view

Here an example (I have selected the group "Gyrolepis"): JabRef_floatinggroups_gray

If I do use the search and just select a group from the groups panel, then the following should happen: (a) Only the articles remain visible in the main table view, that contain the searched term - all other articles become invisible. (b) The articles that are part of the group (and contain the search term) all move to the top of the main table view (c) All articles that do not belong to the group but contain the search term receive a gray colour in the main table view

Here an example (I have selected the group "Gyrolepis" and searched for "bristol"): JabRef_floatinggroups_search_gray

This is what old JabRed did. You present an interesting alternative, where articles, that do not belong to a group and do not contain the search term receive a different greyish colour from articles that belong to the group but do not contain the search term (that leaves the question: What about articles that contain the search term, but do not belong to the group?). That could maybe even improve the functionality over the old JabRef (though, as you say, it remains necessary to determine, how usable that is), but I would already be happy if the old functionality could be restored. I think, if you would want to try this improved version, in any case it would be necessary to stack differently coloured articles in the main table view, e.g.: First all the articles that belong to the group and that contain the search term. Second all the articles that belong to the group but do not contain the search term. Third all the articles that do NOT belong to the group but contain the search term. Fourth all the articles that do NOT belong to the group and do NOT contain the search term.

But as I said, I would already be happy with the functionality of old JabRef, which has also already proven it usability. If you want to see how this looks like in action, check the video linked in this post, which showcases the old JabRef functionality: https://github.com/JabRef/jabref/issues/4237#issuecomment-993325334

Thank you for your help!

Sep 04 '22 13:09 AEgit

Thanks a lot for your feedback!

If I do not use the search and just select a group from the groups panel, then the following should happen: (a) All articles remain visible in the main table view (unlike what happens now, where all articles that are not part of the group become invisible) (b) The articles that are part of the group all move to the top of the main table view (c) All articles that do not belong to the group receive a gray colour in the main table view

Here an example (I have selected the group "Gyrolepis"):

This is possible with this PR with one difference - point (b): There is no re-ordering of the table when selecting a group. Matches do not move to the top. This sounds like an easy thing to add, but would be a little difficult to accomplish actually.

If I do use the search and just select a group from the groups panel, then the following should happen: (a) Only the articles remain visible in the main table view, that contain the searched term - all other articles become invisible. (b) The articles that are part of the group (and contain the search term) all move to the top of the main table view (c) All articles that do not belong to the group but contain the search term receive a gray colour in the main table view

Here an example (I have selected the group "Gyrolepis" and searched for "bristol"):

This would also be possible. Both group and search filtering are separate settings. If you apply the search-filter but not the groups-filter you would have this behavior.

This is what old JabRed did. You present an interesting alternative, where articles, that do not belong to a group and do not contain the search term receive a different greyish colour from articles that belong to the group but do not contain the search term (that leaves the question: What about articles that contain the search term, but do not belong to the group?).

Those have gray background but are not opaque. Unfortunately this is hard to distinguish from the others. Thats my main reason for disliking the UI as I implemented it now.

That could maybe even improve the functionality over the old JabRef (though, as you say, it remains necessary to determine, how usable that is), but I would already be happy if the old functionality could be restored. I think, if you would want to try this improved version, in any case it would be necessary to stack differently coloured articles in the main table view, e.g.: First all the articles that belong to the group and that contain the search term. Second all the articles that belong to the group but do not contain the search term. Third all the articles that do NOT belong to the group but contain the search term. Fourth all the articles that do NOT belong to the group and do NOT contain the search term.

Interesting approach, but as I said the reordering is not easy to do. I think something like introducing an invisible column, giving it values like 1=matches group and search, 2=matches group but not search, 3=matches search but not group, 4=matches nothing and sorting by that. (This is basically done for search results now, as with lucene we can score results so we show the best results on top, with the exception that the column is not hidden). But in this case I would also like to introduce a divider with a description what is what, because it would be very hard to understand without background knowledge (same applies to the behavior as of this PR - one would need to know what the opacity change / gray-out means).

But as I said, I would already be happy with the functionality of old JabRef, which has also already proven it usability. If you want to see how this looks like in action, check the video linked in this post, which showcases the old JabRef functionality: #4237 (comment)

Thank you for your help!

The video looks very much like what this PR allows by filtering the search but not groups. When only one type of floating-mode (non-filter-mode) is applied that also alleviates my concern for UI as it is clear what is happening. Problems show when both are enabled and the UI gets ugly.

For now I will focus on refining the functionality and think some more about the highlighting-modes for non-matches. Maybe we can do the reordering for group-matches later on. I personally would prefer the non-reordering though, as I can click through the groups and see the change in highlighting easier as when the list is reordered all the time.

Sep 04 '22 15:09 btut

Cheers - once a build is available for testing with the new functionality implemented, let me know and I can give it a try and report potential issues/bugs. I can then also test how it fares without reordering - I suspect, ultimately reordering will feel better (especially if you are dealing with groups that contain many articles), but for now let's see, how it works.

Sep 04 '22 16:09 AEgit

Note: The merge with the main branch breaks the search functionality. Investigation is needed.

Sep 04 '22 20:09 btut

The test classes don't compile /home/runner/work/jabref/jabref/src/test/java/org/jabref/logic/pdf/search/retrieval/LuceneSearcherTest.java:94: error: method search in class LuceneSearcher cannot be applied to given types;

/home/runner/work/jabref/jabref/src/test/java/org/jabref/model/search/rules/GrammarBasedSearchRuleTest.java:31: error: cannot find symbol GrammarBasedSearchRule searchRule = new GrammarBasedSearchRule(EnumSet.of(SearchRules.SearchFlags.CASE_SENSITIVE, SearchRules.SearchFlags.REGULAR_EXPRESSION)); ^

Sep 10 '22 14:09 Siedlerchr

The test classes don't compile /home/runner/work/jabref/jabref/src/test/java/org/jabref/logic/pdf/search/retrieval/LuceneSearcherTest.java:94: error: method search in class LuceneSearcher cannot be applied to given types;

/home/runner/work/jabref/jabref/src/test/java/org/jabref/model/search/rules/GrammarBasedSearchRuleTest.java:31: error: cannot find symbol GrammarBasedSearchRule searchRule = new GrammarBasedSearchRule(EnumSet.of(SearchRules.SearchFlags.CASE_SENSITIVE, SearchRules.SearchFlags.REGULAR_EXPRESSION)); ^

I know, I did not clean the tests up yet. The tests fail because the CASE_SENSITIVE flag does not exist any more, but actually I will probably bring it back because it will be needed for search-group migrations.

Sep 11 '22 17:09 btut

Sooner or later we should make a clean cut with our migrations. Its a total mess. All the checks if a migration has to be applied happens on JabRef startup or library loading. They are not reliable nor safe. Imagine there are multiple implementations of a subsystem in v4 in v5 and v6 with similar syntax in the data.

Please try to keep the current implementation free of the deprecated search flag and introduce a separate parser for the data source to be migrated in the migrations. We need to separate the concerns of business logic and migrations.

Sep 17 '22 11:09 calixtus

Please try to keep the current implementation free of the deprecated search flag and introduce a separate parser for the data source to be migrated in the migrations. We need to separate the concerns of business logic and migrations.

This would certainly make it easier to work with the code. If we go that way I would argue to not only remove the search flag, but also all the old search logic (SearchRules, etc). Maybe we can place them in a separate package so they can still be used for the migration?

To me it is still not clear if the effort of migrating search-groups is worth it at all. Mostly because:

Simple searches (just a search string) would still work,
Slightly advanced searches (e.g. author="Me" and keyword="Myself") can be migrated easily by string search-and-replace (equal sign becomes colon for lucene)
Expert searches may not produce the exact same results even after migration (case-sensitive search is not possible, fuzzy results may be added)

Maybe it would be best to just try a few things (like the equal-sign to colon replacement) and then ask the user to update the rules if they are not a valid lucene query?

Sep 17 '22 12:09 btut

I still have the opinion we shouldn't bother with the migration at all. Yes, in an ideal world all groups are perfectly migrated. But in reality we have only limited dev power and as @btut says the migration will likely not succeed in all cases. So just throw the old stuff away and maybe show an error if the search expression in a group is not a valid lucene query. What we should have in either case is a nice migration guide!

(I would also not start to replace things like equal signs with colons; this may just introduce errors or half-valid queries)

Sep 26 '22 20:09 tobiasdiez

I still have the opinion we shouldn't bother with the migration at all.

That is basically what was decided in yesterdays devcall. We can get rid of all the old code and I can clean up the lucene code (right now, it feels tacked-on at some points because of the old code).

For search-groups, we decided to just try and parse them using the lucene query-parser. If that does not succeed we highlight the group in an alarming color and display a warning in the tooltip that the syntax changed. This would also be a nice feature going forwards, not just for migrations.

I would also not start to replace things like equal signs with colons; this may just introduce errors or half-valid queries

I see your point. What about always trying lucene first, and if that fails check:

are there eqal-signs preceeded by a field-name AND
are there no colons and only substitute the equals-sign by a colon then?

Also we could only allow this for search-groups and do this conversion when loading the group (at startup). If the conversion yields a valid lucene-query, we could just show a short message explaining that JabRef converted the query automatically and to check it to be sure. If no lucene-query is found, fall-back to the behavior I explained above (alarming highlight and tooltip).

What do you think @tobiasdiez? The more I write about it the more I agree with you to not even bother with the = to : conversion.

Sep 27 '22 08:09 btut

I like the idea with the "alarming color + tooltip". In my opinion, that is sufficient. I guess (although we don't have data on this), free search groups are used relatively rarely and if they are used than mostly likely with a relative complex query (otherwise you use e.g. the keyword group). The migration also has the disadvantage that its always run and thus should be quite conservative to not destroy valid queries again (e.g. something like keyword: "rain=>flood", and people are very creative when it comes to using JabRef features). But maybe as a compromise to an automatic migration, one could have a "migrate" button in the group dialog that is shown if the query is not valid, and that then does a quick&dirty migration.

Sep 27 '22 10:09 tobiasdiez

https://mail.openjdk.org/pipermail/panama-dev/2022-September/017817.html

Sep 30 '22 06:09 calixtus

Whats the status here? Can this be moved to a mergeable state and the migration stuff be done in another PR?

Oct 07 '22 23:10 calixtus

Hi, Sorry, I am quite busy right now. I would rather clean up the code and do the highlighting for invalid searches before merging. Otherwise people could be confused.

Oct 08 '22 11:10 btut

Ok! Sounds great 👍

Oct 08 '22 13:10 calixtus

@Siedlerchr your commit is faulty. return query... is not a java sentence. Even if it's tempting to do small fixes by yourself, in case of @btut we really should just comment as we can be sure he understands our comments and will react soon.

Nov 07 '22 06:11 calixtus

I did not do a commit, it was just meant to be a comment (btut apparently integrated it then)

Nov 07 '22 07:11 Siedlerchr

Sorry, should have caught that. But no harm done, it's just a dev branch at the moment anyways.

Nov 07 '22 07:11 btut

Sorry, my mistake.

Nov 07 '22 11:11 calixtus

Took the liberty to update this branch to newest main. I will also try to work on your todo list about removing the old search stuff if you don't mind.

Dec 27 '22 18:12 calixtus

Took the liberty to update this branch to newest main. I will also try to work on your todo list about removing the old search stuff if you don't mind.

I would love to see this working properly, but cannot seem to find time for it. So yes please, go ahead and make it happen!

Dec 28 '22 12:12 btut

This is the next big thing, the killer feature for our next major release. Everybody wants to see this working.

Dec 29 '22 00:12 calixtus

We need to check performance of large data bases. To prevent issue reports such as https://github.com/JabRef/jabref/issues/9491 :)

Jan 02 '23 19:01 koppor

Next important step:

Describe testing procedure for the search. It was reported that at the latest commit, that search does not work properly any more.

Ideas:

Plain .bib file
.bib file with attached PDFs
Use test PDFs from https://github.com/JabRef/jabref/tree/main/src/test/resources/pdfs (and maybe add more)
Use .bib files from https://github.com/JabRef/jabref/tree/main/src/test/resources/testbib

Challange: systematic test. That means: Minimal example(s).

Apr 10 '23 19:04 koppor

Should compile now, still heavy issues with the performance. Will have to investigate with jdk 21 and virtual threads. But you can play around with this, test it and report issues here.

Sep 12 '23 22:09 calixtus

Your code currently does not meet JabRef's code guidelines. The tool reviewdog already placed comments on GitHub to indicate the places. See the tab "Files" in you PR. Please carefully follow the setup guide for the codestyle. Afterwards, please run checkstyle locally and fix the issues.

More information on code quality in JabRef is available at https://devdocs.jabref.org/getting-into-the-code/development-strategy.html.

Sep 12 '23 22:09 github-actions[bot]

jabref jabref copied to clipboard

Lucene search backend

Whats working

Todo

Follow-ups

Problems:

jabref
jabref copied to clipboard