Botok
Botok copied to clipboard
🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python
There's some documentation about "custom dialect pack", which I guess is what I want (I want to use a custom, smaller dictionary). But it's very bare and doesn't give any...
When using a script along the lines of ```python from botok import Text, WordTokenizer WT = WordTokenizer() ``` Python3.9 gives ``` " File "/usr/local/lib/python3.9/site-packages/botok/tokenizers/wordtokenizer.py", line 49, in _init_ config.profile, AttributeError:...
In a use case of phonetics I need to distinguish the sound of `བ` (`ba` or `wa`), but this seems currently impossible with botok: - `རབ་གསལ་བས` is tokenized as `རབ་གསལ་...
Again for the purpose of an app on phonetics (not related to Rigpa but following guidelines close to the [Rigpa phonetics guideline](https://www.rigpawiki.org/index.php?title=Rigpa_Phonetic_Guidelines#Weak_Syllables.5B3.5D)), I need to group certain things together, based...
Input: "ཁ་སང་དང་ཁ་སང་གི་སྔ་ལོ།" output: 'ཁ་སང་ དང་ སང་ གི་ སྔ་ལོ ། ' if དང་ཁ་ is in remove word list expected output: ''ཁ་སང་ དང་ ཁ་སང་ གི་ སྔ་ལོ ། '
Input: - str: "ལས་ཞེས་པ་ནི་ལས་བྱེད་པས་ལས་བྱེད་པ་ཡིན་ནོ།། ལས་བྱེད་པས་ལས་མ་བྱེད་པ་མ་ཡིན་ནོ།།" - rule: ["ལ"] ["ས་"] 2 + [] output: "ལ ས་ཞེས་པ་ ནི་ལ ས་ བྱེད་པ ས་ ལ ས་བྱེད་པ་ ཡིན་ ནོ །། ལ ས་བྱེད་པ ས་ ལས་མ་ བྱེད་པ་ མ་...
In botok/botok/vars.py VOWELS = ["ི"] But it should be four (or five), right? like "ཨེཨིཨུཨོ"
```python WT = WordTokenizer() tokens = WT.tokenize(in_str) token = tokens[0] token.get('lemma') ```
Currently we're running everything on a single thread. I wonder if there is a straightforward way to provide a wrapper that allows multi-threading (or even distributing) tokenization.