haystack
haystack copied to clipboard
feat: `LanguageClassifier`
Related Issues
- Discussed in https://github.com/deepset-ai/haystack/discussions/2972
Proposed Changes:
- Add a
LanguageClassifiernode that can be used to classify documents by language. - Introduces a dependency on
langdetectand therefore its dedicated dependency group.
How did you test it?
TODO
Notes for the reviewer
N/A
Checklist
- [x] I have read the contributors guidelines and the code of conduct
- [x] I have updated the related issue with new insights and changes
- [ ] I added tests that demonstrate the correct behavior of the change
- [x] I've used the conventional commit convention for my PR title
- [ ] I documented my code
- [x] I ran pre-commit hooks and fixed any issue
Hi Sara, I was trying to "allow edits" and I followed the instruction but I am unable to do it. But shoud I be able to if you made Pull request? (you are the creator of PR, arent you?)
Hello @stazam yes I was able to open a Pull Request (this page), but I am unable to push commits to your fork, so I can't add the corrections I need to make. The corrections need to be added to your fork in order to show up here.
You can give me access to your fork in other ways if that one didn't work :blush: Try this: https://docs.github.com/en/account-and-profile/setting-up-and-managing-your-personal-account-on-github/managing-access-to-your-personal-repositories/inviting-collaborators-to-a-personal-repository
Ok, i hope you got the invitation :)
Yes I can push now! Thank you :blush:
Note for the assignee/reviewer: verify if this usecase can be covered by DocumentClassifier with a language classification model before proceeding.
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
Stanislav Zamecnik seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it.
I already signed it :)
Ok let me see what I can do for the CLA: the bot seems a bit confused :smile:
Any action necessary from my side? :)
Note for the assignee/reviewer: verify if this usecase can be covered by DocumentClassifier with a language classification model before proceeding.
@ZanSara I add my two cents, because I have some experience in language identification
- there are some transformer models which perform this task, such as papluca/xlm-roberta-base-language-detection. From the search I quickly did online, language identification seems rarely pursued using transformers.
- Some times ago, I found this recent and interesting blogpost on Comparison of language identification models (not transformers). It compares several solutions, including Langdetect and fastText. I think that, if we discard the idea of using transformers for this step, there may also exist solutions with better accuracy and more supported languages than Langdetect.
@anakin87 So what About giving a user an option to choose his own model (with some default set up) like in many other nodes?
I don't mean to put any pressure on anyone, but is there a particular reason that this PR stalled when it seems so close to being complete? I would love to have such a feature so that I could receive queries in any language and route them to the appropriate translation model/pipeline.
@anakin87 I, also, have the impression that transformers tend not to pre-detect languages, and fasttext does seem to be the ideal solution here. But, to @stazam's point, I think it would be ideal if we could choose our own detection method (be it a particular python package or a transformer-based method). Fasttext could be the default for the parameter.
Is there anything I could do to help move this PR along?
Edit: I've done a bit of work and gotten fasttext to work as the detection package. See the below comments for details.
Further work could be done later to make it customizable, but this should be more than fine for now.
p.s. When I said I got fasttext to work in my previous comment, I meant that in only the most literal sense - that if I call _detect_language() directly, it worked.
I just started trying to implement this DocumentLanguageClassifier node in a pipeline and found it to be far too confusing - it wants to work with documents, which doesn't really work how I need it to.
I just want to determine the language of the initial query, and then feed that to the appropriate TransformersTranslator model, and process it with a pipeline.
For now, I'm just using fasttext detect() directly in my application's endpoint and routing it from there. But it would be nice if this node could handle queries as well
@nickchomey The node is supposed to work with document object, that is why the name "DocumentClassifier" 😃 Actually, I am using the node with "langdetect" library, and it works perfectly in my pipeline.
Sorry everyone, I've totally lost track of this PR and I've been on leave for a few days already. I'm back now :smile:
So let's summarize what's up.
-
This node is designed to work with documents, according to @stazam (and the original thread). @nickchomey did you use this node for queries? Does it work? Can we easily generalize it to make it clearer what it can be used for?
-
I fully agree that being able to choose the language detection library is great, but I wouldn't make a single node support all of them. We can have subclasses of an abstract node if we want to support different flavors. Something closer to
LangdetectLanguageClassifier,TransformersLanguageClassifier,FasttextLanguageClassifier, all inheriting fromBaseLanguageClassifier -
This PR is lacking tests. They should definitely be added. @stazam sorry for not highlighting it earlier, I should have pointed it out. If you don't know where to start, have a look at the tests for other nodes, like these for example, and copy the approach. I can help you along if you have issues understanding the code, just let me know. Read about pytest first if you're unfamiliar with it.
-
The CI must be green before merging. Please have a look at the contributing guidelines: https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md
I'll review this PR once it's in a better shape. Of course let me know if there are blockers, I'll try to help.
I was not successful using this node for query language detection. I ended up just installing fast text and calling it directly. Your suggestion for a base and subclasses seems good to me. I don't think I'd be particularly useful in creating any of it though.
@ZanSara how can I help push this PR forward? I think even if it's a small feature, it could be useful for many users.
Some little ideas
- IMHO,
BaseLanguageClassifiercould be a subclass ofBaseDocumentClassifier. Only the methodspredictandpredict_batchshould be implemented. WDYT? - while I said before that
langdetectis not the best language detection library, I note that it already is a base dependency of Haystack. https://github.com/deepset-ai/haystack/blob/1399681c818c03326b6742c410aca97dc9c96b60/pyproject.toml#L72 Therefore, while fastText could represent a heavyweight dependency (not being developed natively in Python), we could have at low-costLangdetectLanguageClassifierandTransformersLanguageClassifier.
Hey @anakin87, thanks for keeping this one up!
-
We can totally make
BaseLanguageClassifiera subclass ofBaseDocumentClassifier, but on the other hand, I believe there is a good usecase for language detection of the query. How do you feel about generalizing this class to be able to perform language detection on queries and documents? We can have two separate nodes if needed, one inheriting fromBaseDocumentClassifier, another fromBaseQueryClassifier(mind thatBaseQueryClassifiermight use a revamp, is very basic and possibily unused right now), or a single node inheriting from both classes. -
Let's start with
TransformersLanguageClassifierandLangdetectLanguageClassifier. From there, adding other options should be easy if it happens to be demand for it.
Make sure to merge main into this branch before resuming work here, as a lot have changed in the meantime and merge conflicts could be very painful later.
Hello @ZanSara!
Your idea seems interesting...
I have two doubts:
- I don't have the rights to push to this PR, so I need to open a new one. Am I wrong?
- A deeper doubt is related to #2999. Given the limitations of the current implementation of Pipelines, does it make sense to design a decision node for language detection if in most cases we need to divide documents into consistent language groups before passing them to an indexing pipeline? Maybe I misunderstood the limits of Pipelines... An alternative solution could be to enrich the document with language metadata. I need to know your point of view on this aspect...
@anakin87 Ad 1. No need to opne new PR, I added you as a collaborator. Ad 2. If we will say in documentation, that it is node supposed to be before Preprocessor, than everything should be fine, I guess.
@anakin87
-
Have a try if you can make it work on @stazam fork: otherwise, you can fork from his and progress from there.
-
Regarding the branching: yes that's a good questions. I believe we can do both: adding the language in the metadata and route onto different branches, separately. Something like this:
class BaseLanguageClassifier(BaseDocumentClassifier):
def __init__(
self,
add_lang_to_meta: bool = True,
route_by_lang: bool = True,
langs_to_route: Optional[List] = ['english']
):
...
So you can decide whether to route onto different output edges or not, and which languages to route, independently of whether the detected language is being added to the metadata or not. Should not be too much of a overhead. WDYT?
Hey @ZanSara! I started to revamp this old and dear PR :smile:
I would like to keep this node simple and flexible.
The issue of routing docs of different languages makes me think...
-
the method
runshould not support documents of multiple languages in the same list -
(obviously, calling
runseveral times, each one with a different document, is ok) -
about
run_batchI have mixed feelings... It would be great to make it route mixed documents of multiple languages, but I see that for example, theFileClassifierdoesn't support such a situation. And in any case, it would break the original documents lists. So I think that supporting this feature inrun_batchcould overcomplicate things...
What do you think about this point?
(another little question: what is the ideal purpose of run_batch? It seems to me that sometimes this method implements different behaviours in different nodes)
@ZanSara since our last discussions, a few things have changed and I've been trying to make some decisions on this node...
Proposed Changes:
new nodes for Document Language Classification:
LangdetectDocumentLanguageClassifier and TransformersDocumentLanguageClassifier
- they store the language in the Documents metadata
- they can send the Documents to different edges, depending on their language
(if
route_by_languageparameter isTrue) - :warning: if routing is enabled, mixed languages aren't allowed in the same list of documents (as in similar decision nodes)
How did you test it?
Several new tests.
Notes for the reviewer
I tried to make an effort to carry on this PR… So I made several assumptions during development and now it makes some sense...
In any case, let's try together to put it into shape! :muscle:
@ZanSara I've incorporated your feedback and left just one small comment...
News:
- Since some Transformers models do not return language names but generic labels,
now the user can specify a mapping.
Eg:
labels_to_languages_mapping = {"LABEL_1": "ar", "LABEL_2": "bg", ...}(the chosen default model returns correct language codes and does not require the user to specify this mapping) - new/updated tests for this feature
- smaller Transformer model in tests
Ready for the final review! (I see some problems with the CLA...)
@anakin87, @ZanSara I am actually not sure if it is not a problem, that i did not sign CLA, although i have already did like few months ago.
@stazam can you retry to sign it?
@anakin87 I tried many times, but it is not possible (there is no re-sign button, or something). @ZanSara If it is necessary, I can send the form to your mail (I don't want to block this, for too long).
@ZanSara please take a look when you can (still some problems with the CLA...)
Hey sorry for the silence. The CLA bot is really acting up on this PR... I'll reach out to a colleague to see what we can do for it. I'll review in the meantime :slightly_smiling_face: