jabref
jabref copied to clipboard
Add support for Lucene as search syntax
Fixes https://github.com/JabRef/jabref/issues/1975
As far as I tested, the search syntax of groups can be reused (= signs also work at Lucene).
The search using regular expressions is different in Lucene. One has to use /.../ to indicate regular expressions. Thus, groups can now also use regular expressions.
Left TODOs:
- [ ] The reg ex switch in the search bar should be removed. All reg ex searches should be handled by the lucene-based search
- [ ] The search rule describer for Lucene has to be created. Currently only a dummy.
- [ ] Think of removing the search grammar completely
- [ ] Optional: Implement Glob patterns - first idea: http://svn.apache.org/repos/asf/jakarta/oro/trunk/src/java/org/apache/oro/text/ --> GlobCompiler.java -- or "just" transform the string to a regex --> https://stackoverflow.com/q/1247772/873282
- [x] Change in
CHANGELOG.mddescribed in a way that is understandable for the average user (if applicable) - [x] Tests created for changes (if applicable)
- [ ] Manually tested changed features in running JabRef (always required)
- [ ] Screenshots added in PR description (for UI changes)
- [ ] Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.
I think overall it would make sense to rework the search mechanism to use a Lucene index to improve speed.
A few thoughts on that:
- Currently, search is done for each entry individually. So a query is executed on each entry and a boolean return value indicates whether the entry fits the query or not. With Lucene, a query is executed on the index and returns entries. For the Fulltext-search, I had to implement a workaround, storing all returned entries that fit a querry and then, when the query is executed on an entry, it actually checks if the entry is in that list. I dislike that workaround.
- Lucene does not provide a binary decision whether the entry matches the query. Lucene provides a score. This means it cannot (should not) be used as a binary filter over the entry-table (as is done now). Instead, the entry-table should be sorted according to the scoring. To give the user feedback on where the query was found in an entry, it would make sense to leverage Lucene's...
- Highlighting. Lucene can generate html formatted (converted to a JavaFX TextFlow for the fulltext results) highlighted string, similar to what google does when displaying search results.
I sense a damn f**** cool possible new (or renewed) feature for JabRef here. Sort by lucene score and/or highlight/opaque entries by score - like floating results in ancient swing times.
I agree. If done well, this could be a very neat feature and the speed-up would be icing on the cake.
Current blocker - Lucene does not support == and !=.
=inexact match. In Lucene, a glob pattern would be needed. In the current implementation, I did not enforce that, but implemented an inexact match==is an exact match. Parsing error in Lucene!=is NOT. Parsing error in Lucene

How to proceed?
- Keep our search (and maybe map to Lucene for full text support)
- Drop support for
==and!=
I would completely drop our search syntax and only use lucenes one. If users really want an equals instead of contains search, then we can add this later (for example by adding a field alias with no tokenizer).
Note to self: Read Lucene documentation thoroughly --> http://www.lucenetutorial.com/lucene-query-syntax.html
What is a match?
- Currently:
title:Jab--> Matches "JabRef tool" - Lucene:
title:Jab--> does not match "JabRef tool"
- Currently:
title:Jab--> Matches "JabRef tool" - Lucene:
title:Jab*--> Matches "JabRef tool"
- Currently:
title:JabRef--> Matches "JabRef tool" - Lucene:
title:JabRef--> Matches "JabRef tool"
It is OK for me to implement like that. The "only" concern I have is that it does not match the UX one has from WWW search enginges (Google, Bing, ...)
If users really want an equals instead of contains search, then we can add this later (for example by adding a field alias with no tokenizer).
I think the most efficient way would be to do a fuzzy search anyhow and then do the equals search on the results of the fuzzy search as we did before. I think that would be better than storing everything twice (tokenized and untokenized).
... After writing this I did a short google search and it appears like exact matches are possible in lucene using quotes. https://stackoverflow.com/questions/37495639/how-to-match-exact-text-in-lucene-search Isn't that exactly what we want?
Minor comment: I rebased this PR on main, because we merged https://github.com/JabRef/jabref/pull/8636
ADR https://github.com/JabRef/jabref/blob/main/docs/adr/0015-support-an-abstract-query-syntax-for-query-conversion.md should be updated
... After writing this I did a short google search and it appears like exact matches are possible in lucene using quotes.
I would object - based on https://www.lucenetutorial.com/lucene-query-syntax.html


I would follow the suggestion of https://github.com/JabRef/jabref/pull/8206#issuecomment-968372167 and implement the interpretation of the Lucene syntax "more relaxed".
Question: getting rid of current search syntax will break groups that are configured to search based on
searching for a keywordfree search expression
right? Is there a way the syntax could be backwards compatible?
Question: getting rid of current search syntax will break groups that are configured to search based on
searching for a keywordfree search expressionright? Is there a way the syntax could be backwards compatible?
I never used that feature so I had to look it up first. In general, one goal of the lucene-implementation would be to get rid of all the old search infrastructure. This is hard to do when keeping backwards compatibility. But there is hope. 'Simple' searches, just looking for a string in a field, should straight up just work in lucene as well. The problems arise when we look at the case-sensitive and regex switches and to scoring:
- Lucene does not support case-insensitive searches. Searches are always case-sensitive. If you want a case-insensitive search, you lowercase everything that you put into the index and your search strings.
- I do not know if the regex implementation of JabRef and Lucene match. If so, great. If not, we could have a 'migration' feature that parses search strings in the group settings on startup and if they match JabRef syntax they are translated to lucene syntax. We need the regex parser for other things anyways so we wouldn't get rid of it anyways.
- The problem that makes the current fulltext-search less usable than it could be is the scoring. Lucene almost always finds something. It might not be a good fit but still better than other entries. We need a good scoring mechanism to filter out irrelevant results. I don't know if a simple score-threshold would suffice or we need something more sophisticated.
Superseeded by https://github.com/JabRef/jabref/pull/8963
This PR can be used to check for similarities and differences by using the Lucene engine or (our) custom search.