Botok icon indicating copy to clipboard operation
Botok copied to clipboard

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python

Results 31 Botok issues
Sort by recently updated
recently updated
newest added

催更完整的帮助文档,这对于更多开发者看到它并使用它很重要!

Hopefully to solve #76. Also some code cleanups.

Hi, when sentence tokenizing Tibetan text with English (or Non-Tibetan?) words at the end of the text, the ending English part is missing from the results of sentence tokenization. Another...

Lemmatization tests on Travis CI succeeded on [Windows](https://travis-ci.com/github/BLKSerene/Wordless/jobs/381823626) and [Linux](https://travis-ci.com/github/BLKSerene/Wordless/jobs/381823628), but failed on [macOS](https://travis-ci.com/github/BLKSerene/Wordless/jobs/381823627). And same results for [Azure Pipelines](https://dev.azure.com/blkserene/BLKSerene%20-%20Github/_build/results?buildId=278&view=results) but [AppVeyor](https://ci.appveyor.com/project/BLKSerene/wordless/builds/35064316) passed. botok version: 0.8.1

མངས་བས་ should be split as: མངས་བ/n.v.past + ས་/case.agn (POS added for illustration), but botok somehow splits it as མང + ས་བ + ས་ which seems odd? ས་བ exists as a...

https://github.com/OpenPecha/Botok/blob/master/botok/resources/bo_uni_table.csv

Reproduce script ```python tokens = wt.tokenize("རིན་ཆེན་མིའི") print(tokens) ``` output: ``` [text: "རིན་ཆེན་" text_cleaned: "རིན་ཆེན་" text_unaffixed: "རིན་ཆེན་" syls: ["རིན", "ཆེན"] pos: OTHER lemma: རིན་ཆེན་ senses: | pos: OTHER, freq: 22841, affixed:...

System: - botok: v0.8.8 Reproduce ```python tokens = wt.tokenize("༄༅། །བློ་སྦྱོང་དོན་?") print(tokens[0]) ``` Output ``` text: "༄༅། །" char_types: |NORMAL_PUNCT|NORMAL_PUNCT|NORMAL_PUNCT|TRANSPARENT|NORMAL_PUNCT| chunk_type: PUNCT start: 0 len: 5 ``` Expected output: ``` text:...

yang jug of non word are not able to parse syl = "བསྟནད" sc = SylComponents() syl = remove_tsekdung(syl) components = sc.get_parts(syl) components is returned as None

- [ ] syllable toknizer needed