inltk
inltk copied to clipboard
Integrating with HuggingFace Transformer
Hi, Could you give me some insights whether is possible to plug in inltk with huggingface transformer library
I was looking for the same. Maybe we can use multi-lingual transformers. But the question is how to tokenize Indian Languages which have different structure. Is there any way to break them for BPE. I am eager to work on this and contribute.
@octalpixel , @parmarsuraj99 Thanks for reaching out. Currently, it isn't straightforward/possible to integrate it with the transformers library. I'll be happy have contributions from the community to help with it.
So, we just need a tokenizer trained on Indian languages separately and then we just plug it directly to a LM? Maybe Hindi on SentencePiece attached to HuggingFace BERT. Should I go this way?
@parmarsuraj99 yes you can use sentencepiece or Huggingface's tokenizers (https://github.com/huggingface/tokenizers) library. I've been working on training BERT Hindi model using the tokenizers and transformers library from Huggingface.
@goru001 I am really excited to work on that. I believe a trained Hindi model would be really efficient to grasp other regional languages as well, as most are similar. Really looking forward for it.