Botok issues

催更帮助文档！

催更完整的帮助文档，这对于更多开发者看到它并使用它很重要！

Tshor

Make error handling more robust when downloading dialect packs

1

Hopefully to solve #76. Also some code cleanups.

BLKSerene

Missing English words at the end of the text during sentence tokenization

Hi, when sentence tokenizing Tibetan text with English (or Non-Tibetan?) words at the end of the text, the ending English part is missing from the results of sentence tokenization. Another...

BLKSerene

Download of dialect packs fails on macOS when running CI

1

Lemmatization tests on Travis CI succeeded on [Windows](https://travis-ci.com/github/BLKSerene/Wordless/jobs/381823626) and [Linux](https://travis-ci.com/github/BLKSerene/Wordless/jobs/381823628), but failed on [macOS](https://travis-ci.com/github/BLKSerene/Wordless/jobs/381823627). And same results for [Azure Pipelines](https://dev.azure.com/blkserene/BLKSerene%20-%20Github/_build/results?buildId=278&view=results) but [AppVeyor](https://ci.appveyor.com/project/BLKSerene/wordless/builds/35064316) passed. botok version: 0.8.1

BLKSerene

Splitting མངས་བས་ wrong?

མངས་བས་ should be split as: མངས་བ/n.v.past + ས་/case.agn (POS added for illustration), but botok somehow splits it as མང + ས་བ + ས་ which seems odd? ས་བ exists as a...

lothelanor

[Feature] Classify all PUNCTs into left and right

https://github.com/OpenPecha/Botok/blob/master/botok/resources/bo_uni_table.csv

10zinten

`token.text_unaffixed` failed to add tsek

Reproduce script ```python tokens = wt.tokenize("རིན་ཆེན་མིའི") print(tokens) ``` output: ``` [text: "རིན་ཆེན་" text_cleaned: "རིན་ཆེན་" text_unaffixed: "རིན་ཆེན་" syls: ["རིན", "ཆེན"] pos: OTHER lemma: རིན་ཆེན་ senses: | pos: OTHER, freq: 22841, affixed:...

10zinten

Missing pos for PUNCT

10zinten

syllable component

yang jug of non word are not able to parse syl = "བསྟནད" sc = SylComponents() syl = remove_tsekdung(syl) components = sc.get_parts(syl) components is returned as None

kaldan007

syllable tokenizer request

- [ ] syllable toknizer needed

ta4tsering

Botok
Botok copied to clipboard

Metadata

催更帮助文档！

Make error handling more robust when downloading dialect packs

Missing English words at the end of the text during sentence tokenization

Download of dialect packs fails on macOS when running CI

Splitting མངས་བས་ wrong?

[Feature] Classify all PUNCTs into left and right

`token.text_unaffixed` failed to add tsek

Missing pos for PUNCT

syllable component

syllable tokenizer request

← Metadata

Owner

Metadata

Botok Botok copied to clipboard

Metadata

← Metadata

Owner

Metadata

Botok
Botok copied to clipboard