drupchen
drupchen
The error comes from the fact you're feeding the sentence_tokenizer() a list of strings whereas it is expecting a list of Token objects – which would have attributes such as...
Sorry for taking so long. Here it is at last: The problem would arise if we were to use the CQL matcher against Token attributes containing ints or booleans, such...
What I can think of right now is first running the preprocessing on the input string ([here](https://github.com/Esukhia/botok/blob/master/botok/tokenizers/wordtokenizer.py#L74)), then distributing pieces of the generated chunks to be tokenized to different threads,...
A simple way of doing it would be to change this line: https://github.com/Esukhia/botok/blob/improve-tok/botok/tokenizers/wordtokenizer.py#L80. Creating a new method that returns `tokens` where all the multiprocessing happens will keep things simple, and...
@mikkokotila, I'm all for improvements! Thanks for the proposal. Have you looked at how easy it is to set up different components for the Text class? Here's how you use...
I found the commit where the other vowels were taken out : https://github.com/Esukhia/botok/commit/8b47270755732d73bf19a9205907dba67278bf34# I don't remember if it was intentional or not...
out of the box, here is what pybo's preprocessor does: In the first line of output, `('TEXT', 0, 4)`, `0` stands for the starting index of the chunk, `4` for...
the cases of `ག གི གྲ ཀ ཤ པ མ` are not handled yet (the issue is still open). I think the best will be to work on a list...
@ngawangtrinley says: - shad always belong to what is on their left - all yigo types belong to the text on their right - separators like drulshad belong to what...
ok, then it's perfect. When I have more time, I'll port your implementation in python.