haystack icon indicating copy to clipboard operation
haystack copied to clipboard

feat: `LanguageClassifier`

Open ZanSara opened this issue 3 years ago • 11 comments

Related Issues

  • Discussed in https://github.com/deepset-ai/haystack/discussions/2972

Proposed Changes:

  • Add a LanguageClassifier node that can be used to classify documents by language.
  • Introduces a dependency on langdetect and therefore its dedicated dependency group.

How did you test it?

TODO

Notes for the reviewer

N/A

Checklist

ZanSara avatar Aug 08 '22 13:08 ZanSara

Hi Sara, I was trying to "allow edits" and I followed the instruction but I am unable to do it. But shoud I be able to if you made Pull request? (you are the creator of PR, arent you?)

stazam avatar Aug 08 '22 21:08 stazam

Hello @stazam yes I was able to open a Pull Request (this page), but I am unable to push commits to your fork, so I can't add the corrections I need to make. The corrections need to be added to your fork in order to show up here.

You can give me access to your fork in other ways if that one didn't work :blush: Try this: https://docs.github.com/en/account-and-profile/setting-up-and-managing-your-personal-account-on-github/managing-access-to-your-personal-repositories/inviting-collaborators-to-a-personal-repository

ZanSara avatar Aug 09 '22 08:08 ZanSara

Ok, i hope you got the invitation :)

stazam avatar Aug 09 '22 08:08 stazam

Yes I can push now! Thank you :blush:

ZanSara avatar Aug 09 '22 13:08 ZanSara

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Aug 09 '22 13:08 CLAassistant

Note for the assignee/reviewer: verify if this usecase can be covered by DocumentClassifier with a language classification model before proceeding.

ZanSara avatar Aug 10 '22 15:08 ZanSara

CLA assistant check Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Stanislav Zamecnik seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it.

I already signed it :)

stazam avatar Aug 11 '22 08:08 stazam

Ok let me see what I can do for the CLA: the bot seems a bit confused :smile:

ZanSara avatar Aug 11 '22 09:08 ZanSara

Any action necessary from my side? :)

stazam avatar Aug 17 '22 08:08 stazam

Note for the assignee/reviewer: verify if this usecase can be covered by DocumentClassifier with a language classification model before proceeding.

@ZanSara I add my two cents, because I have some experience in language identification

  • there are some transformer models which perform this task, such as papluca/xlm-roberta-base-language-detection. From the search I quickly did online, language identification seems rarely pursued using transformers.
  • Some times ago, I found this recent and interesting blogpost on Comparison of language identification models (not transformers). It compares several solutions, including Langdetect and fastText. I think that, if we discard the idea of using transformers for this step, there may also exist solutions with better accuracy and more supported languages than Langdetect.

anakin87 avatar Aug 17 '22 11:08 anakin87

@anakin87 So what About giving a user an option to choose his own model (with some default set up) like in many other nodes?

stazam avatar Aug 17 '22 14:08 stazam

I don't mean to put any pressure on anyone, but is there a particular reason that this PR stalled when it seems so close to being complete? I would love to have such a feature so that I could receive queries in any language and route them to the appropriate translation model/pipeline.

@anakin87 I, also, have the impression that transformers tend not to pre-detect languages, and fasttext does seem to be the ideal solution here. But, to @stazam's point, I think it would be ideal if we could choose our own detection method (be it a particular python package or a transformer-based method). Fasttext could be the default for the parameter.

Is there anything I could do to help move this PR along?

Edit: I've done a bit of work and gotten fasttext to work as the detection package. See the below comments for details.

Further work could be done later to make it customizable, but this should be more than fine for now.

nickchomey avatar Sep 25 '22 19:09 nickchomey

p.s. When I said I got fasttext to work in my previous comment, I meant that in only the most literal sense - that if I call _detect_language() directly, it worked.

I just started trying to implement this DocumentLanguageClassifier node in a pipeline and found it to be far too confusing - it wants to work with documents, which doesn't really work how I need it to.

I just want to determine the language of the initial query, and then feed that to the appropriate TransformersTranslator model, and process it with a pipeline.

For now, I'm just using fasttext detect() directly in my application's endpoint and routing it from there. But it would be nice if this node could handle queries as well

nickchomey avatar Sep 27 '22 23:09 nickchomey

@nickchomey The node is supposed to work with document object, that is why the name "DocumentClassifier" 😃 Actually, I am using the node with "langdetect" library, and it works perfectly in my pipeline.

stazam avatar Sep 29 '22 07:09 stazam

Sorry everyone, I've totally lost track of this PR and I've been on leave for a few days already. I'm back now :smile:

So let's summarize what's up.

  • This node is designed to work with documents, according to @stazam (and the original thread). @nickchomey did you use this node for queries? Does it work? Can we easily generalize it to make it clearer what it can be used for?

  • I fully agree that being able to choose the language detection library is great, but I wouldn't make a single node support all of them. We can have subclasses of an abstract node if we want to support different flavors. Something closer to LangdetectLanguageClassifier, TransformersLanguageClassifier, FasttextLanguageClassifier, all inheriting from BaseLanguageClassifier

  • This PR is lacking tests. They should definitely be added. @stazam sorry for not highlighting it earlier, I should have pointed it out. If you don't know where to start, have a look at the tests for other nodes, like these for example, and copy the approach. I can help you along if you have issues understanding the code, just let me know. Read about pytest first if you're unfamiliar with it.

  • The CI must be green before merging. Please have a look at the contributing guidelines: https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md

I'll review this PR once it's in a better shape. Of course let me know if there are blockers, I'll try to help.

ZanSara avatar Oct 11 '22 17:10 ZanSara

I was not successful using this node for query language detection. I ended up just installing fast text and calling it directly. Your suggestion for a base and subclasses seems good to me. I don't think I'd be particularly useful in creating any of it though.

nickchomey avatar Oct 11 '22 21:10 nickchomey

@ZanSara how can I help push this PR forward? I think even if it's a small feature, it could be useful for many users.

Some little ideas

  • IMHO, BaseLanguageClassifier could be a subclass of BaseDocumentClassifier. Only the methods predict and predict_batch should be implemented. WDYT?
  • while I said before that langdetect is not the best language detection library, I note that it already is a base dependency of Haystack. https://github.com/deepset-ai/haystack/blob/1399681c818c03326b6742c410aca97dc9c96b60/pyproject.toml#L72 Therefore, while fastText could represent a heavyweight dependency (not being developed natively in Python), we could have at low-cost LangdetectLanguageClassifier and TransformersLanguageClassifier.

anakin87 avatar Nov 17 '22 17:11 anakin87

Hey @anakin87, thanks for keeping this one up!

  • We can totally make BaseLanguageClassifier a subclass of BaseDocumentClassifier, but on the other hand, I believe there is a good usecase for language detection of the query. How do you feel about generalizing this class to be able to perform language detection on queries and documents? We can have two separate nodes if needed, one inheriting from BaseDocumentClassifier, another from BaseQueryClassifier (mind that BaseQueryClassifier might use a revamp, is very basic and possibily unused right now), or a single node inheriting from both classes.

  • Let's start with TransformersLanguageClassifier and LangdetectLanguageClassifier. From there, adding other options should be easy if it happens to be demand for it.

Make sure to merge main into this branch before resuming work here, as a lot have changed in the meantime and merge conflicts could be very painful later.

ZanSara avatar Dec 05 '22 10:12 ZanSara

Hello @ZanSara!

Your idea seems interesting...

I have two doubts:

  • I don't have the rights to push to this PR, so I need to open a new one. Am I wrong?
  • A deeper doubt is related to #2999. Given the limitations of the current implementation of Pipelines, does it make sense to design a decision node for language detection if in most cases we need to divide documents into consistent language groups before passing them to an indexing pipeline? Maybe I misunderstood the limits of Pipelines... An alternative solution could be to enrich the document with language metadata. I need to know your point of view on this aspect...

anakin87 avatar Dec 05 '22 11:12 anakin87

@anakin87 Ad 1. No need to opne new PR, I added you as a collaborator. Ad 2. If we will say in documentation, that it is node supposed to be before Preprocessor, than everything should be fine, I guess.

stazam avatar Dec 05 '22 12:12 stazam

@anakin87

  • Have a try if you can make it work on @stazam fork: otherwise, you can fork from his and progress from there.

  • Regarding the branching: yes that's a good questions. I believe we can do both: adding the language in the metadata and route onto different branches, separately. Something like this:

class BaseLanguageClassifier(BaseDocumentClassifier):

    def __init__(
            self, 
            add_lang_to_meta: bool = True, 
            route_by_lang: bool = True, 
            langs_to_route: Optional[List] = ['english']
        ):
        ...

So you can decide whether to route onto different output edges or not, and which languages to route, independently of whether the detected language is being added to the metadata or not. Should not be too much of a overhead. WDYT?

ZanSara avatar Dec 07 '22 09:12 ZanSara

Hey @ZanSara! I started to revamp this old and dear PR :smile:

I would like to keep this node simple and flexible.

The issue of routing docs of different languages ​​makes me think...

  • the method run should not support documents of multiple languages in the same list

  • (obviously, calling run several times, each one with a different document, is ok)

  • about run_batch I have mixed feelings... It would be great to make it route mixed documents of multiple languages, but I see that for example, the FileClassifier doesn't support such a situation. And in any case, it would break the original documents lists. So I think that supporting this feature in run_batch could overcomplicate things...

What do you think about this point?

(another little question: what is the ideal purpose of run_batch? It seems to me that sometimes this method implements different behaviours in different nodes)

anakin87 avatar Feb 15 '23 21:02 anakin87

@ZanSara since our last discussions, a few things have changed and I've been trying to make some decisions on this node...

Proposed Changes:

new nodes for Document Language Classification: LangdetectDocumentLanguageClassifier and TransformersDocumentLanguageClassifier

  • they store the language in the Documents metadata
  • they can send the Documents to different edges, depending on their language (if route_by_language parameter is True)
  • :warning: if routing is enabled, mixed languages aren't allowed in the same list of documents (as in similar decision nodes)

How did you test it?

Several new tests.

Notes for the reviewer

I tried to make an effort to carry on this PR… So I made several assumptions during development and now it makes some sense...

In any case, let's try together to put it into shape! :muscle:

anakin87 avatar Feb 21 '23 23:02 anakin87

@ZanSara I've incorporated your feedback and left just one small comment...

anakin87 avatar Feb 22 '23 20:02 anakin87

News:

  • Since some Transformers models do not return language names but generic labels, now the user can specify a mapping. Eg: labels_to_languages_mapping = {"LABEL_1": "ar", "LABEL_2": "bg", ...} (the chosen default model returns correct language codes and does not require the user to specify this mapping)
  • new/updated tests for this feature
  • smaller Transformer model in tests

Ready for the final review! (I see some problems with the CLA...)

anakin87 avatar Feb 23 '23 17:02 anakin87

@anakin87, @ZanSara I am actually not sure if it is not a problem, that i did not sign CLA, although i have already did like few months ago.

stazam avatar Feb 23 '23 19:02 stazam

@stazam can you retry to sign it?

anakin87 avatar Feb 23 '23 19:02 anakin87

@anakin87 I tried many times, but it is not possible (there is no re-sign button, or something). @ZanSara If it is necessary, I can send the form to your mail (I don't want to block this, for too long).

stazam avatar Feb 23 '23 19:02 stazam

@ZanSara please take a look when you can (still some problems with the CLA...)

anakin87 avatar Feb 28 '23 11:02 anakin87

Hey sorry for the silence. The CLA bot is really acting up on this PR... I'll reach out to a colleague to see what we can do for it. I'll review in the meantime :slightly_smiling_face:

ZanSara avatar Feb 28 '23 13:02 ZanSara