haystack feat: `LanguageClassifier`

Related Issues

Discussed in https://github.com/deepset-ai/haystack/discussions/2972

Proposed Changes:

Add a LanguageClassifier node that can be used to classify documents by language.
Introduces a dependency on langdetect and therefore its dedicated dependency group.

How did you test it?

TODO

Notes for the reviewer

N/A

Checklist

[x] I have read the contributors guidelines and the code of conduct
[x] I have updated the related issue with new insights and changes
[ ] I added tests that demonstrate the correct behavior of the change
[x] I've used the conventional commit convention for my PR title
[ ] I documented my code
[x] I ran pre-commit hooks and fixed any issue

Aug 08 '22 13:08 ZanSara

Hi Sara, I was trying to "allow edits" and I followed the instruction but I am unable to do it. But shoud I be able to if you made Pull request? (you are the creator of PR, arent you?)

Aug 08 '22 21:08 stazam

Hello @stazam yes I was able to open a Pull Request (this page), but I am unable to push commits to your fork, so I can't add the corrections I need to make. The corrections need to be added to your fork in order to show up here.

You can give me access to your fork in other ways if that one didn't work :blush: Try this: https://docs.github.com/en/account-and-profile/setting-up-and-managing-your-personal-account-on-github/managing-access-to-your-personal-repositories/inviting-collaborators-to-a-personal-repository

Aug 09 '22 08:08 ZanSara

Ok, i hope you got the invitation :)

Aug 09 '22 08:08 stazam

Yes I can push now! Thank you :blush:

Aug 09 '22 13:08 ZanSara

All committers have signed the CLA.

Aug 09 '22 13:08 CLAassistant

Note for the assignee/reviewer: verify if this usecase can be covered by DocumentClassifier with a language classification model before proceeding.

Aug 10 '22 15:08 ZanSara

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Stanislav Zamecnik seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it.

I already signed it :)

Aug 11 '22 08:08 stazam

Ok let me see what I can do for the CLA: the bot seems a bit confused :smile:

Aug 11 '22 09:08 ZanSara

Any action necessary from my side? :)

Aug 17 '22 08:08 stazam

Note for the assignee/reviewer: verify if this usecase can be covered by DocumentClassifier with a language classification model before proceeding.

@ZanSara I add my two cents, because I have some experience in language identification

there are some transformer models which perform this task, such as papluca/xlm-roberta-base-language-detection. From the search I quickly did online, language identification seems rarely pursued using transformers.
Some times ago, I found this recent and interesting blogpost on Comparison of language identification models (not transformers). It compares several solutions, including Langdetect and fastText. I think that, if we discard the idea of using transformers for this step, there may also exist solutions with better accuracy and more supported languages than Langdetect.

Aug 17 '22 11:08 anakin87

@anakin87 So what About giving a user an option to choose his own model (with some default set up) like in many other nodes?

Aug 17 '22 14:08 stazam

I don't mean to put any pressure on anyone, but is there a particular reason that this PR stalled when it seems so close to being complete? I would love to have such a feature so that I could receive queries in any language and route them to the appropriate translation model/pipeline.

@anakin87 I, also, have the impression that transformers tend not to pre-detect languages, and fasttext does seem to be the ideal solution here. But, to @stazam's point, I think it would be ideal if we could choose our own detection method (be it a particular python package or a transformer-based method). Fasttext could be the default for the parameter.

Is there anything I could do to help move this PR along?

Edit: I've done a bit of work and gotten fasttext to work as the detection package. See the below comments for details.

Further work could be done later to make it customizable, but this should be more than fine for now.

Sep 25 '22 19:09 nickchomey

p.s. When I said I got fasttext to work in my previous comment, I meant that in only the most literal sense - that if I call _detect_language() directly, it worked.

I just started trying to implement this DocumentLanguageClassifier node in a pipeline and found it to be far too confusing - it wants to work with documents, which doesn't really work how I need it to.

I just want to determine the language of the initial query, and then feed that to the appropriate TransformersTranslator model, and process it with a pipeline.

For now, I'm just using fasttext detect() directly in my application's endpoint and routing it from there. But it would be nice if this node could handle queries as well

Sep 27 '22 23:09 nickchomey

@nickchomey The node is supposed to work with document object, that is why the name "DocumentClassifier" 😃 Actually, I am using the node with "langdetect" library, and it works perfectly in my pipeline.

Sep 29 '22 07:09 stazam

Sorry everyone, I've totally lost track of this PR and I've been on leave for a few days already. I'm back now :smile:

So let's summarize what's up.

This node is designed to work with documents, according to @stazam (and the original thread). @nickchomey did you use this node for queries? Does it work? Can we easily generalize it to make it clearer what it can be used for?
I fully agree that being able to choose the language detection library is great, but I wouldn't make a single node support all of them. We can have subclasses of an abstract node if we want to support different flavors. Something closer to LangdetectLanguageClassifier, TransformersLanguageClassifier, FasttextLanguageClassifier, all inheriting from BaseLanguageClassifier
This PR is lacking tests. They should definitely be added. @stazam sorry for not highlighting it earlier, I should have pointed it out. If you don't know where to start, have a look at the tests for other nodes, like these for example, and copy the approach. I can help you along if you have issues understanding the code, just let me know. Read about pytest first if you're unfamiliar with it.
The CI must be green before merging. Please have a look at the contributing guidelines: https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md

I'll review this PR once it's in a better shape. Of course let me know if there are blockers, I'll try to help.

Oct 11 '22 17:10 ZanSara

I was not successful using this node for query language detection. I ended up just installing fast text and calling it directly. Your suggestion for a base and subclasses seems good to me. I don't think I'd be particularly useful in creating any of it though.

Oct 11 '22 21:10 nickchomey

@ZanSara how can I help push this PR forward? I think even if it's a small feature, it could be useful for many users.

Some little ideas

IMHO, BaseLanguageClassifier could be a subclass of BaseDocumentClassifier. Only the methods predict and predict_batch should be implemented. WDYT?
while I said before that langdetect is not the best language detection library, I note that it already is a base dependency of Haystack. https://github.com/deepset-ai/haystack/blob/1399681c818c03326b6742c410aca97dc9c96b60/pyproject.toml#L72 Therefore, while fastText could represent a heavyweight dependency (not being developed natively in Python), we could have at low-cost LangdetectLanguageClassifier and TransformersLanguageClassifier.

Nov 17 '22 17:11 anakin87

Hey @anakin87, thanks for keeping this one up!

We can totally make BaseLanguageClassifier a subclass of BaseDocumentClassifier, but on the other hand, I believe there is a good usecase for language detection of the query. How do you feel about generalizing this class to be able to perform language detection on queries and documents? We can have two separate nodes if needed, one inheriting from BaseDocumentClassifier, another from BaseQueryClassifier (mind that BaseQueryClassifier might use a revamp, is very basic and possibily unused right now), or a single node inheriting from both classes.
Let's start with TransformersLanguageClassifier and LangdetectLanguageClassifier. From there, adding other options should be easy if it happens to be demand for it.

Make sure to merge main into this branch before resuming work here, as a lot have changed in the meantime and merge conflicts could be very painful later.

Dec 05 '22 10:12 ZanSara

Hello @ZanSara!

Your idea seems interesting...

I have two doubts:

I don't have the rights to push to this PR, so I need to open a new one. Am I wrong?
A deeper doubt is related to #2999. Given the limitations of the current implementation of Pipelines, does it make sense to design a decision node for language detection if in most cases we need to divide documents into consistent language groups before passing them to an indexing pipeline? Maybe I misunderstood the limits of Pipelines... An alternative solution could be to enrich the document with language metadata. I need to know your point of view on this aspect...

Dec 05 '22 11:12 anakin87

@anakin87 Ad 1. No need to opne new PR, I added you as a collaborator. Ad 2. If we will say in documentation, that it is node supposed to be before Preprocessor, than everything should be fine, I guess.

Dec 05 '22 12:12 stazam

@anakin87

Have a try if you can make it work on @stazam fork: otherwise, you can fork from his and progress from there.
Regarding the branching: yes that's a good questions. I believe we can do both: adding the language in the metadata and route onto different branches, separately. Something like this:

class BaseLanguageClassifier(BaseDocumentClassifier):

    def __init__(
            self, 
            add_lang_to_meta: bool = True, 
            route_by_lang: bool = True, 
            langs_to_route: Optional[List] = ['english']
        ):
        ...

So you can decide whether to route onto different output edges or not, and which languages to route, independently of whether the detected language is being added to the metadata or not. Should not be too much of a overhead. WDYT?

Dec 07 '22 09:12 ZanSara

Hey @ZanSara! I started to revamp this old and dear PR :smile:

I would like to keep this node simple and flexible.

The issue of routing docs of different languages makes me think...

the method run should not support documents of multiple languages in the same list
(obviously, calling run several times, each one with a different document, is ok)
about run_batch I have mixed feelings... It would be great to make it route mixed documents of multiple languages, but I see that for example, the FileClassifier doesn't support such a situation. And in any case, it would break the original documents lists. So I think that supporting this feature in run_batch could overcomplicate things...

What do you think about this point?

(another little question: what is the ideal purpose of run_batch? It seems to me that sometimes this method implements different behaviours in different nodes)

Feb 15 '23 21:02 anakin87

@ZanSara since our last discussions, a few things have changed and I've been trying to make some decisions on this node...

Proposed Changes:

new nodes for Document Language Classification: LangdetectDocumentLanguageClassifier and TransformersDocumentLanguageClassifier

they store the language in the Documents metadata
they can send the Documents to different edges, depending on their language (if route_by_language parameter is True)
:warning: if routing is enabled, mixed languages aren't allowed in the same list of documents (as in similar decision nodes)

How did you test it?

Several new tests.

Notes for the reviewer

I tried to make an effort to carry on this PR… So I made several assumptions during development and now it makes some sense...

In any case, let's try together to put it into shape! :muscle:

Feb 21 '23 23:02 anakin87

@ZanSara I've incorporated your feedback and left just one small comment...

Feb 22 '23 20:02 anakin87

News:

Since some Transformers models do not return language names but generic labels, now the user can specify a mapping. Eg: labels_to_languages_mapping = {"LABEL_1": "ar", "LABEL_2": "bg", ...} (the chosen default model returns correct language codes and does not require the user to specify this mapping)
new/updated tests for this feature
smaller Transformer model in tests

Ready for the final review! (I see some problems with the CLA...)

Feb 23 '23 17:02 anakin87

@anakin87, @ZanSara I am actually not sure if it is not a problem, that i did not sign CLA, although i have already did like few months ago.

Feb 23 '23 19:02 stazam

@stazam can you retry to sign it?

Feb 23 '23 19:02 anakin87

@anakin87 I tried many times, but it is not possible (there is no re-sign button, or something). @ZanSara If it is necessary, I can send the form to your mail (I don't want to block this, for too long).

Feb 23 '23 19:02 stazam

@ZanSara please take a look when you can (still some problems with the CLA...)

Feb 28 '23 11:02 anakin87

Hey sorry for the silence. The CLA bot is really acting up on this PR... I'll reach out to a colleague to see what we can do for it. I'll review in the meantime :slightly_smiling_face:

Feb 28 '23 13:02 ZanSara

haystack haystack copied to clipboard

feat: `LanguageClassifier`

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Proposed Changes:

How did you test it?

Notes for the reviewer

haystack
haystack copied to clipboard