drupchen comments

Results 13 comments of


                                            drupchen

Sentencize a list of tokens that have been manually tokenized by adding spaces

The error comes from the fact you're feeding the sentence_tokenizer() a list of strings whereas it is expecting a list of Token objects – which would have attributes such as...

NONE error when trying to match int or bool token attributes

Sorry for taking so long. Here it is at last: The problem would arise if we were to use the CQL matcher against Token attributes containing ints or booleans, such...

multi-threading

What I can think of right now is first running the preprocessing on the input string ([here](https://github.com/Esukhia/botok/blob/master/botok/tokenizers/wordtokenizer.py#L74)), then distributing pieces of the generated chunks to be tokenized to different threads,...

multi-threading

A simple way of doing it would be to change this line: https://github.com/Esukhia/botok/blob/improve-tok/botok/tokenizers/wordtokenizer.py#L80. Creating a new method that returns `tokens` where all the multiprocessing happens will keep things simple, and...

statistics performance with tokenizer.list_word_types

@mikkokotila, I'm all for improvements! Thanks for the proposal. Have you looked at how easy it is to set up different components for the Text class? Here's how you use...

Why VOWELS constant only has one vowel?

I found the commit where the other vowels were taken out : https://github.com/Esukhia/botok/commit/8b47270755732d73bf19a9205907dba67278bf34# I don't remember if it was intentional or not...

finding sentence limits

out of the box, here is what pybo's preprocessor does: In the first line of output, `('TEXT', 0, 4)`, `0` stands for the starting index of the chunk, `4` for...

finding sentence limits

the cases of `ག གི གྲ ཀ ཤ པ མ` are not handled yet (the issue is still open). I think the best will be to work on a list...

finding sentence limits

@ngawangtrinley says: - shad always belong to what is on their left - all yigo types belong to the text on their right - separators like drulshad belong to what...

finding sentence limits

ok, then it's perfect. When I have more time, I'll port your implementation in python.