torchnlp icon indicating copy to clipboard operation
torchnlp copied to clipboard

Using only encoder part for word accentation

Open aleksas opened this issue 5 years ago • 4 comments

Should it be possible to use only transformers encoder part to train word accentation for Lithuanian language. In Lithuanian language stressing is somewhat tricky as it can vary dependyng on context along with word meaning (e.g. grammar case). You've mentioned in your post using only encoding part for one to one mapping. In case of Lithuanian language accentation, there are three types of accent and the position of the accent within the word (varies alot). And there can also be no accent at all. Any suggestions?

aleksas avatar Nov 27 '18 10:11 aleksas

So is the accent on particular characters? You could define tags at character levels and basically work with a character level Transformer Encoder.

kolloldas avatar Nov 30 '18 23:11 kolloldas

Yes, the accent is on specific letter. Does Transform need a dictionary for character level taggng? What should my next steps be in order to train Transformer accentation on Lithuanina language. I have a dataset of ~13 K sentences with accentation. I'm suspicious it may not be enough to train Transformer though, but I'm very keen to try...

aleksas avatar Dec 02 '18 14:12 aleksas

I think you can map the input directly to the unicode character values. The infrastructure around the Tagger classes currently works at a word (+char) level. We'll have to make it more generic to handle character only input (An incentive for me to work on this!).

But the Transformer module is independent of the input (check this file).

13K sentences should be more than enough if you're working at a character level. Do you have tags for each character (including none)?

kolloldas avatar Dec 04 '18 04:12 kolloldas

I do have tags for each char.

aleksas avatar Dec 04 '18 07:12 aleksas