haystack
haystack copied to clipboard
Sparse retriever lemmatizer
Is your feature request related to a problem? Please describe. More semantic like search using sparse retriever Performance
Describe the solution you'd like Spacy lemmatizer is available for multiple languages returns good results most times. So each document could be stored in it's base form too, German example: text: Ich gehe jeden zweiten Tag Fussball spielen Base: Ich gehen jeden zwei Tag Fussball spielen
With an query like: Original: wann gehst du Fussball spielen ? Base: wann gehen Ich Fussball spielen
The version after lemmatizer would become higher score. In same step I'd like to open the idea of query expander again.
What do you think ?
Describe alternatives you've considered I don't know an good alternative for it
Additional context Add any other context or screenshots about the feature request here.
Hey @flozi00! This seems to be an interesting feature. An alternative to this could be to make use of Elasticsearch's stemming. However, it seems that Elasticsearch's stemming does not always produce the same stem for words with the same root (see for example here).
@tholor What do you think? Would this be something that we see as part of haystack?
Interesting idea, but I agree with @bogdankostic that leveraging elastic's existing components (e.g. stemmer, synoynms, analyzer ...) will probably be more scalable and meaningful. This has the advantage that everything happens on the index side and we don't need to duplicate the documents in an index (the "original" and the "lemmatized" one).
However, I see quite some potential to improve the handling of these elastic options in Haystack. There could be options to automatically generate lists of synonyms (see also https://github.com/deepset-ai/haystack/issues/841), configure stemmer, or create lists of questions that can be answered from a doc, generate a list of "keywords" for a doc ....
For whatever it is worth, I also think it would be very useful to be able to incorporate spaCy into haystack pipelines - particularly for the Lemmatization. It is my understanding that Stemming/lemmatization is undesirable for full semantic/transformer capabilities, but in the event that someone wants to do just keyword searching, lemmatization is vastly superior to stemming.
Also, spaCy seems to have a very similar ethos/focus as Haystack - consolidating state of the art techniques and tools into one package that is accessible to practitioners. Beyond top-notch NLP capabilities, they offer immense multilingual support and, since v3.0, also have an entire transformer mechanism that integrates with huggingface models. So, there really must be a lot of overlap/synergy with Haystack and surely it could be added in some meaningful way into your stack!
Also, it is all in Cython which it doesn't appear that Haystack uses (but I could be wrong) which makes it immensely more performant.