Botok icon indicating copy to clipboard operation
Botok copied to clipboard

POS tags ? distinguishing some patterns

Open eroux opened this issue 4 years ago • 2 comments

In a use case of phonetics I need to distinguish the sound of (ba or wa), but this seems currently impossible with botok:

  • རབ་གསལ་བས is tokenized as རབ་གསལ་ - བས (in that case བས is pronounced )
  • བྱང་ཆུབ་བར་དུ is tokenized as བྱང་ཆུབ་ - བར་ - དུ (in that case བར is pronounced bar)

is there any way I discriminate between the two with botok (or any other tool)?

eroux avatar Nov 14 '21 15:11 eroux

བར་དུ་ should be added to the vocab. I would argue that it's a frozen expression by now. We'll add instructions on how to do this in the botok docs

ngawangtrinley avatar Nov 15 '21 03:11 ngawangtrinley

well, what I'll do with another POS tagger is to look at the n.rel tag of https://web.archive.org/web/20170824153724/http://larkpie.net/tibetancorpus/tags

eroux avatar Nov 15 '21 08:11 eroux