Botok icon indicating copy to clipboard operation
Botok copied to clipboard

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python

Results 31 Botok issues
Sort by recently updated
recently updated
newest added

I suggest that we start to use labeling in following way: - there are three concepts; priority labels, context labels, and other labels - priority labels have three classes; issues,...

In the below toy example, my expectation is to achieve a tokenized version of the input text. With the below code, the result is a list of tokens, but tokens...

As it stands, `Text(doc).list_word_types` includes tokenization and statistical operation (basic word frequency). In a typical workflow I might first tokenize, and then get some statistics for it. Obviously this would...

Trying to match int and bool with cql creates a NONE error. This seems to happen somewhere in the fsa file. It's an issue since it stops us from matching...

help wanted

Hi, I'm wondering whether it is possible to conduct sentence tokenization on a list of tokens that have already been tokenized (without breaking the original word tokenization)? I tried [the...

The sentence_tokenizer() and paragraph_tokenizer() should add attributes about sentences in the Token objects directly instead of creating a new list of Tokens embedded in tuples. An idea is to use...

enhancement

While it seems quite reasonable to cut on naro + shad, there are so many edge cases where the proper cut is difficult to find that it would be helpful...

This PR excludes the test suite from the published package. They are not needed for production use.

It seems that the Python installer was accidentally pushed to the repository. This PR removes it.