inltk
inltk copied to clipboard
add code-mixed language identifier
What?
In this PR, I have added the support for identifying code-mixed and Indian languages written in Roman script. Currently, it can detect Hinglish, Tanglish and Manglish and Hindi, Tamil, and Malayalam written in Roman script.
Related issues
Solves #76, #54
Why?
In this toolkit, support has been provided for identifying languages written in the native script. However, if an Indian language is written in Indian script, it would predict 'en' or English. That's why this feature might be helpful. Moreover, detecting if a language is code-mixed or not and identifying the code-mixed languages are also not present.
How?
- At first, I created a dataset of English, Hinglish, Tanglish, and Manglish sentences from Dravidian Codemix datasets, HinglishNorm dataset, and Wikipedia sentences.
- Then, I finetuned the indic-bert model on the dataset for a classification task using fastai to maintain coherency with the pre-trained models of this repository. The inference learner is then exported and uploaded to Dropbox.
- Similar to other pre-trained models, I have added the functionality to download the model from Dropbox and use it for prediction in the
download_assets.py
file. The downloaded learner is saved in the codemixed folder within models. - The file
codemixed_util.py
contains the necessary classes for the learner to run. These classes need to be imported while running the code. - I have also added an argument
check_codemixed
to theidentify_language
function. When set toFalse
, it returns en or English if the input is in Latin script. When set toTrue
, it executes theidentify_codemixed
function to detect code-mixed instances in the input. - Adding this functionality also adds the dependency of the
transformers
library.
Testing?
I have written some unit tests for this functionality. You can check the unit tests and the output in this Github Gist. Apart from that, I also ran other tests to make sure that no dependencies get broken or any other functionality fails.
Example code
Refer to this gist for example code of this functionality.
Concerns
- One major concern is that the classes from
codemixed_util.py
need to be imported before running the code-mixed identifier. Else it will raiseAttributeError
.
Anything Else?
For more insight into the dataset creation and classification model training, check this repository
Thanks a lot @tathagata-raha for your contribution. Your work looks great, I just had few comments:
- It'll be great if you can also add documentation for this functionality in the docs
- I see that you're using transformers library, but don't see it added in dependencies, you'll need to add that as well right? Let me know if I'm missing something here.
- Have you checked how transformers torch dependency and iNLTK's torch dependency will work out together? Some of the pretrained models in iNLTK give error with torch >= 1.4 (For eg, see this issue). I hope you've checked this?
- I didn't get the concern, what implications does it have and how does it affect the end user?
Again, thanks for the great work and apologies for the delay in reply. Let me know once you've clarified the above, I'll do final round of testing and then we should be good to release.
@tathagata-raha Thanks for your work! I cloned the code from your branch here, installed it with a pip and tried simple identify_language call. I've got an error as follows:
Traceback (most recent call last):
File "/home/dmitry/Projects/inltk_detect/inltk_detect.py", line 33, in
check_codemixed=True)
File "/home/dmitry/anaconda3/envs/inltk/lib/python3.6/site-packages/inltk/inltk.py", line 94, in identify_language
return identify_codemixed(input)
File "/home/dmitry/anaconda3/envs/inltk/lib/python3.6/site-packages/inltk/inltk.py", line 77, in identify_codemixed
learn = load_learner(path / 'models' / 'codemixed')
File "/home/dmitry/anaconda3/envs/inltk/lib/python3.6/site-packages/fastai/basic_train.py", line 619, in load_learner
state = torch.load(source, map_location='cpu') if defaults.device == torch.device('cpu') else torch.load(source)
File "/home/dmitry/anaconda3/envs/inltk/lib/python3.6/site-packages/torch/serialization.py", line 426, in load
return _load(f, map_location, pickle_module, **pickle_load_args)
File "/home/dmitry/anaconda3/envs/inltk/lib/python3.6/site-packages/torch/serialization.py", line 613, in _load
result = unpickler.load()
File "/home/dmitry/anaconda3/envs/inltk/lib/python3.6/site-packages/transformers/tokenization_albert.py", line 168, in setstate
self.sp_model.Load(self.vocab_file)
File "/home/dmitry/anaconda3/envs/inltk/lib/python3.6/site-packages/sentencepiece.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/home/dmitry/anaconda3/envs/inltk/lib/python3.6/site-packages/sentencepiece.py", line 177, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
OSError: Not found: "/home/tathagata.raha/.cache/torch/transformers/960f997efd899fcbfa4733fe76d0c5f6382ca387918643930c819e7ce5b54e99.3bbebe118dc3b5069a335a9507a5cf87e9418c4e97110395c9f0043eba6e095b": No such file or directory Error #2
Process finished with exit code 1
As you can see it's a reference to your local home directory somewhere (not in the code although). Any ideas how to fix it?
@tathagata-raha I've made some investigations, I suppose it's related to the way model was saved. Could you please take a look: https://github.com/huggingface/transformers/issues/5292. Can I ask you to re-save the model, so It would be possible to use it in a different environment?
@tathagata-raha
I figured out the cause of the issue. Posting it here in case you are interested or may be someone else will need it.
Fast.ai saves and loads a transformer using the torch.save
and torch.load
, but it causes the error described above - saving with local cache paths (transformer is not supposed to be saved that way). To solve that it's necessary to save transformer with save_pretrained
and load it separately from the fast.ai learner.
Saving:
# minimal data for inference
databunch.export("data.pkl")
...
# saving trained transformer
learner.model.transformer.save_pretrained("transformer")
Loading:
Instead of load_learner(path / 'models' / 'codemixed')
in identify_codemixed
function we can do something like this:
pretrained_name = 'ai4bharat/indic-bert'
config = AutoConfig.from_pretrained(pretrained_name)
transformer_model = AutoModelForSequenceClassification.from_config(config)
custom_transformer_model = CustomTransformerModel(transformer_model = transformer_model)
databunch = DataBunch.load_empty(path=model_path, fname="data.pkl")
# creating dummy learner
learner = Learner(databunch, custom_transformer_model, opt_func=AdamW)
# load pretrained transformer
learner.model.transformer = AutoModelForSequenceClassification.from_pretrained(model_path / "transformer")
Maybe there is a more elegant way to do it, but at least it works
3. Have you checked how transformers torch dependency and iNLTK's torch dependency will work out together? Some of the pretrained models in iNLTK give error with torch >= 1.4 (For eg, see this issue). I hope you've checked this?
I tried it and transformers works with iNLTK's torch dependency just fine, in case I'm using transformers==3.5.1, not 4+ version