inltk icon indicating copy to clipboard operation
inltk copied to clipboard

POS tagging

Open TviNet opened this issue 5 years ago • 4 comments

https://universaldependencies.org/ has labelled data for parts of speech, dependencies and information about morphology for Hindi, Sanskrit, Marathi, Tamil and Telugu. I plan on using a LM-LSTM-CRF architecture for sequence tagging. However the language models in iNLTK use sentencepiece tokens. Could anyone guide me through using the existing lm for word tokens or do I need to retrain the word embeddings for word tokens?

TviNet avatar May 22 '19 08:05 TviNet

@TviNet Thanks for reaching out! I glanced over LM-LSTM-CRF repo, and saw that they're considering every space separated word as a token. I think you can do that for Indic languages as well. But in this case you might not be able to use transfer learning (use pretrained LMs ) (I might be wrong here, need to dig deep into repo, but a quick glance at it makes me think this way).

The way I was thinking of tackling POS is to use transfer learning by doing some pre-processing over the dataset, which would be - breakdown every word into its token (using what we have in iNLTK) and their corresponding tags into -> <sometag1, sometag2, sometag3> depending upon the number of tokens it gets broken down into. I think this will yield better model/results. But we should experiment.

Let me know what your thoughts are. Thanks!

goru001 avatar May 23 '19 03:05 goru001

I tried averaging subtokens and then an LSTM+CRF which gave decent results for Hindi ( 13k train sentences, 96.3% accuracy) but not for Tamil (400 train sentences, 87% accuracy). Other languages similarly have very few training samples.

TviNet avatar May 23 '19 19:05 TviNet

Yes, that's why I think using transfer learning is important here, especially for low resource languages.

goru001 avatar May 25 '19 06:05 goru001

Hi,

In case if you are interested in a BiLSTM based Tamil POS tagger (this developed using Stanza framework): https://github.com/sarves/thamizhi-pos You can find relevant models and tagged data.

Sarves

sarves avatar Jan 11 '21 14:01 sarves