inltk add code-mixed language identifier

What?

In this PR, I have added the support for identifying code-mixed and Indian languages written in Roman script. Currently, it can detect Hinglish, Tanglish and Manglish and Hindi, Tamil, and Malayalam written in Roman script.

Related issues

Solves #76, #54

Why?

In this toolkit, support has been provided for identifying languages written in the native script. However, if an Indian language is written in Indian script, it would predict 'en' or English. That's why this feature might be helpful. Moreover, detecting if a language is code-mixed or not and identifying the code-mixed languages are also not present.

How?

At first, I created a dataset of English, Hinglish, Tanglish, and Manglish sentences from Dravidian Codemix datasets, HinglishNorm dataset, and Wikipedia sentences.
Then, I finetuned the indic-bert model on the dataset for a classification task using fastai to maintain coherency with the pre-trained models of this repository. The inference learner is then exported and uploaded to Dropbox.
Similar to other pre-trained models, I have added the functionality to download the model from Dropbox and use it for prediction in the download_assets.py file. The downloaded learner is saved in the codemixed folder within models.
The file codemixed_util.py contains the necessary classes for the learner to run. These classes need to be imported while running the code.
I have also added an argument check_codemixed to the identify_language function. When set to False, it returns en or English if the input is in Latin script. When set to True, it executes the identify_codemixed function to detect code-mixed instances in the input.
Adding this functionality also adds the dependency of the transformers library.

Testing?

I have written some unit tests for this functionality. You can check the unit tests and the output in this Github Gist. Apart from that, I also ran other tests to make sure that no dependencies get broken or any other functionality fails.

Example code

Refer to this gist for example code of this functionality.

Concerns

One major concern is that the classes from codemixed_util.py need to be imported before running the code-mixed identifier. Else it will raise AttributeError.

Anything Else?

For more insight into the dataset creation and classification model training, check this repository

Aug 30 '21 22:08 tathagata-raha

Thanks a lot @tathagata-raha for your contribution. Your work looks great, I just had few comments:

It'll be great if you can also add documentation for this functionality in the docs
I see that you're using transformers library, but don't see it added in dependencies, you'll need to add that as well right? Let me know if I'm missing something here.
Have you checked how transformers torch dependency and iNLTK's torch dependency will work out together? Some of the pretrained models in iNLTK give error with torch >= 1.4 (For eg, see this issue). I hope you've checked this?
I didn't get the concern, what implications does it have and how does it affect the end user?

Again, thanks for the great work and apologies for the delay in reply. Let me know once you've clarified the above, I'll do final round of testing and then we should be good to release.

Oct 02 '21 07:10 goru001

@tathagata-raha Thanks for your work! I cloned the code from your branch here, installed it with a pip and tried simple identify_language call. I've got an error as follows:

Traceback (most recent call last):
File "/home/dmitry/Projects/inltk_detect/inltk_detect.py", line 33, in
check_codemixed=True)
File "/home/dmitry/anaconda3/envs/inltk/lib/python3.6/site-packages/inltk/inltk.py", line 94, in identify_language
return identify_codemixed(input)
File "/home/dmitry/anaconda3/envs/inltk/lib/python3.6/site-packages/inltk/inltk.py", line 77, in identify_codemixed
learn = load_learner(path / 'models' / 'codemixed')
File "/home/dmitry/anaconda3/envs/inltk/lib/python3.6/site-packages/fastai/basic_train.py", line 619, in load_learner
state = torch.load(source, map_location='cpu') if defaults.device == torch.device('cpu') else torch.load(source)
File "/home/dmitry/anaconda3/envs/inltk/lib/python3.6/site-packages/torch/serialization.py", line 426, in load
return _load(f, map_location, pickle_module, **pickle_load_args)
File "/home/dmitry/anaconda3/envs/inltk/lib/python3.6/site-packages/torch/serialization.py", line 613, in _load
result = unpickler.load()
File "/home/dmitry/anaconda3/envs/inltk/lib/python3.6/site-packages/transformers/tokenization_albert.py", line 168, in setstate
self.sp_model.Load(self.vocab_file)
File "/home/dmitry/anaconda3/envs/inltk/lib/python3.6/site-packages/sentencepiece.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/home/dmitry/anaconda3/envs/inltk/lib/python3.6/site-packages/sentencepiece.py", line 177, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
OSError: Not found: "/home/tathagata.raha/.cache/torch/transformers/960f997efd899fcbfa4733fe76d0c5f6382ca387918643930c819e7ce5b54e99.3bbebe118dc3b5069a335a9507a5cf87e9418c4e97110395c9f0043eba6e095b": No such file or directory Error #2

Process finished with exit code 1

As you can see it's a reference to your local home directory somewhere (not in the code although). Any ideas how to fix it?

Jan 20 '22 15:01 dsvolkov

@tathagata-raha I've made some investigations, I suppose it's related to the way model was saved. Could you please take a look: https://github.com/huggingface/transformers/issues/5292. Can I ask you to re-save the model, so It would be possible to use it in a different environment?

Jan 21 '22 15:01 dsvolkov

@tathagata-raha I figured out the cause of the issue. Posting it here in case you are interested or may be someone else will need it. Fast.ai saves and loads a transformer using the torch.save and torch.load, but it causes the error described above - saving with local cache paths (transformer is not supposed to be saved that way). To solve that it's necessary to save transformer with save_pretrained and load it separately from the fast.ai learner.

Saving:

# minimal data for inference
databunch.export("data.pkl")
...
# saving trained transformer
learner.model.transformer.save_pretrained("transformer")

Loading: Instead of load_learner(path / 'models' / 'codemixed') in identify_codemixed function we can do something like this:

pretrained_name = 'ai4bharat/indic-bert'
config = AutoConfig.from_pretrained(pretrained_name)

transformer_model = AutoModelForSequenceClassification.from_config(config) 
custom_transformer_model = CustomTransformerModel(transformer_model = transformer_model)
databunch = DataBunch.load_empty(path=model_path, fname="data.pkl")

# creating dummy learner
learner = Learner(databunch, custom_transformer_model, opt_func=AdamW)
# load pretrained transformer
learner.model.transformer = AutoModelForSequenceClassification.from_pretrained(model_path / "transformer")

Maybe there is a more elegant way to do it, but at least it works

Jan 24 '22 19:01 dsvolkov

3. Have you checked how transformers torch dependency and iNLTK's torch dependency will work out together? Some of the pretrained models in iNLTK give error with torch >= 1.4 (For eg, see this issue). I hope you've checked this?

I tried it and transformers works with iNLTK's torch dependency just fine, in case I'm using transformers==3.5.1, not 4+ version

Jan 24 '22 19:01 dsvolkov

inltk inltk copied to clipboard

add code-mixed language identifier

What?

Related issues

Why?

How?

Testing?

Example code

Concerns

Anything Else?

inltk
inltk copied to clipboard