Botok issues

importing a custom dictionary

1

There's some documentation about "custom dialect pack", which I guess is what I want (I want to use a custom, smaller dictionary). But it's very bare and doesn't give any...

eroux

When using a script along the lines of ```python from botok import Text, WordTokenizer WT = WordTokenizer() ``` Python3.9 gives ``` " File "/usr/local/lib/python3.9/site-packages/botok/tokenizers/wordtokenizer.py", line 49, in _init_ config.profile, AttributeError:...

eroux

bug

POS tags ? distinguishing some patterns

2

In a use case of phonetics I need to distinguish the sound of `བ` (`ba` or `wa`), but this seems currently impossible with botok: - `རབ་གསལ་བས` is tokenized as `རབ་གསལ་...

eroux

identifying weak syllables

1

Again for the purpose of an app on phonetics (not related to Rigpa but following guidelines close to the [Rigpa phonetics guideline](https://www.rigpawiki.org/index.php?title=Rigpa_Phonetic_Guidelines#Weak_Syllables.5B3.5D)), I need to group certain things together, based...

eroux

Unexpected skip of syllable while tokenizing.

Input: "ཁ་སང་དང་ཁ་སང་གི་སྔ་ལོ།" output: 'ཁ་སང་ དང་ སང་ གི་ སྔ་ལོ ། ' if དང་ཁ་ is in remove word list expected output: ''ཁ་སང་ དང་ ཁ་སང་ གི་ སྔ་ལོ ། '

kaldan007

Invalid index in merge rule silently produces uncalled for result.

Input: - str: "ལས་ཞེས་པ་ནི་ལས་བྱེད་པས་ལས་བྱེད་པ་ཡིན་ནོ།། ལས་བྱེད་པས་ལས་མ་བྱེད་པ་མ་ཡིན་ནོ།།" - rule: ["ལ"] ["ས་"] 2 + [] output: "ལ ས་ཞེས་པ་ ནི་ལ ས་ བྱེད་པ ས་ ལ ས་བྱེད་པ་ ཡིན་ ནོ །། ལ ས་བྱེད་པ ས་ ལས་མ་ བྱེད་པ་ མ་...

kaldan007

Why VOWELS constant only has one vowel?

1

In botok/botok/vars.py VOWELS = ["ི"] But it should be four (or five), right? like "ཨེཨིཨུཨོ"

forest-jiang

detect any language

https://pypi.org/project/cldr-language-helpers/

ngawangtrinley

enhancement

dict like `get` method for Token object

```python WT = WordTokenizer() tokens = WT.tokenize(in_str) token = tokens[0] token.get('lemma') ```

10zinten

enhancement

multi-threading

6

Currently we're running everything on a single thread. I wonder if there is a straightforward way to provide a wrapper that allows multi-threading (or even distributing) tokenization.

mikkokotila

Botok
Botok copied to clipboard

Metadata

importing a custom dictionary

issue with Python 3.9

POS tags ? distinguishing some patterns

identifying weak syllables

Unexpected skip of syllable while tokenizing.

Invalid index in merge rule silently produces uncalled for result.

Why VOWELS constant only has one vowel?

detect any language

dict like `get` method for Token object

multi-threading

← Metadata

Owner

Metadata

Botok Botok copied to clipboard

Metadata

← Metadata

Owner

Metadata

Botok
Botok copied to clipboard