inltk icon indicating copy to clipboard operation
inltk copied to clipboard

Integrating with HuggingFace Transformer

Open octalpixel opened this issue 4 years ago • 5 comments

Hi, Could you give me some insights whether is possible to plug in inltk with huggingface transformer library

octalpixel avatar Mar 15 '20 08:03 octalpixel

I was looking for the same. Maybe we can use multi-lingual transformers. But the question is how to tokenize Indian Languages which have different structure. Is there any way to break them for BPE. I am eager to work on this and contribute.

parmarsuraj99 avatar Apr 09 '20 05:04 parmarsuraj99

@octalpixel , @parmarsuraj99 Thanks for reaching out. Currently, it isn't straightforward/possible to integrate it with the transformers library. I'll be happy have contributions from the community to help with it.

goru001 avatar Apr 10 '20 19:04 goru001

So, we just need a tokenizer trained on Indian languages separately and then we just plug it directly to a LM? Maybe Hindi on SentencePiece attached to HuggingFace BERT. Should I go this way?

parmarsuraj99 avatar Apr 11 '20 05:04 parmarsuraj99

@parmarsuraj99 yes you can use sentencepiece or Huggingface's tokenizers (https://github.com/huggingface/tokenizers) library. I've been working on training BERT Hindi model using the tokenizers and transformers library from Huggingface.

goru001 avatar Apr 11 '20 07:04 goru001

@goru001 I am really excited to work on that. I believe a trained Hindi model would be really efficient to grasp other regional languages as well, as most are similar. Really looking forward for it.

parmarsuraj99 avatar Apr 12 '20 12:04 parmarsuraj99