openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

Add solr support for synonyms for numbers/abbreviations

Open popcar2 opened this issue 2 years ago • 7 comments

I've been using the website for a long time now and one of my biggest gripes is how searching works. When searching for books in OpenLibrary, you often need to write exactly the correct title. This means that if a book uses words for numbers (One, Two, Three etc), searching the same title with digits (1, 2, 3 etc) would give no result.

Another example is if a book uses "Vol." in the title, searching "volume" would net no result even though they mean the same thing. This makes finding specific books a lot more difficult.

Describe the problem that you'd like solved

The search engine searches exact terms, but it should have tolerance when dealing with numbers or words of equivalent meaning. Here's an example:

image image I would like searching "The Walking Dead Compendium Four" and "The Walking Dead Compendium 4" to find the book.

Proposal & Constraints

The search engine should be error tolerant to words of the same meaning. "Vol." should be the same as writing "Volume" "Two" should be the same as writing "2" or "II" "&" and "and" should also be interchangeable.

Additional context

Another example, but with "vol" and "volume" image image

popcar2 avatar Jun 07 '22 16:06 popcar2

I think the solution for this would be to make use of solr's synonyms feature. But some experimenting / investigation needed. Anyone who has some time to experiment with adding synonyms to solr, please do!

cdrini avatar Aug 15 '22 17:08 cdrini

@cdrini I would like to do it, how can I?

bicolino34 avatar Aug 16 '22 06:08 bicolino34

The search is strict not only with terms, but also with letters. Compare Безпека життєдіяльності and Безпека життєдіяльност. With just one letter missing (і) there are no results

bicolino34 avatar Aug 20 '22 16:08 bicolino34

So this is a solr research task; here are some of places where it will need modifications:

The solr schema which defines the various type of text fields has synonyms enabled -- but only at query time:

https://github.com/internetarchive/openlibrary/blob/82bc2f61c8c41363567d398b7b027a16775dbc91/conf/solr/conf/managed-schema#L426-L467

This blog post has some info: https://library.brown.edu/create/digitaltechnologies/using-synonyms-in-solr/

In a nutshell we need synonyms inside https://github.com/internetarchive/openlibrary/blob/ccabd95be2a82c4f79d94b1f10e46ea1d3c5c730/conf/solr/conf/synonyms.txt

And then test locally with a full reindex (See https://github.com/internetarchive/openlibrary/wiki/Solr#making-changes-to-solr-config )

But for numbers, they probably need to be in English only for now? I'm not sure how we should handle non-English numbers. Ideally we'd want different synonyms files for different user locales, but I'm not sure if/how to do this in solr.

cdrini avatar Aug 23 '22 20:08 cdrini

But we can definitely add something like vol,vol.,Volume in there and see if it helps with that!

cdrini avatar Aug 23 '22 20:08 cdrini

Actually it looks like the synonyms file is working! You can see the TV one in action here: https://openlibrary.org/search?q=television+kid&mode=everything .

So adding volume should be easy enough!

cdrini avatar Aug 23 '22 20:08 cdrini

@bicolino34 For your issue, that would probably be handled by solr's spell checking features. So having something like "Did you mean?" when a user's query is close to be not perfectly correct. Would you mind creating a separate issue to add support for "Did you mean?" ? That'll require a different approach on the solr side, but would help users a ton!

cdrini avatar Aug 30 '22 17:08 cdrini

Is this addressed by #6922? Can this issue be consolidated into that one?

mekarpeles avatar Jul 15 '24 15:07 mekarpeles

#6922 is the PR which is meant to address this issue, but it's stuck in review and has some issues.

tfmorris avatar Jul 15 '24 17:07 tfmorris