Botok statistics performance with tokenizer.list_word

As it stands, Text(doc).list_word_types includes tokenization and statistical operation (basic word frequency). In a typical workflow I might first tokenize, and then get some statistics for it. Obviously this would be quite painful with bigger doc as I would have to basically spend twice the time.

May I suggest we separate statistics into its own class, that accepts as its input any of the outputs from Text(doc). That way we can offer other common things like co-occurrence and ngram statistics.

I would be happy to make a PR for such a class if you think it's a good idea. I think it will be good also in terms of keeping the namespace of Text() clean as well, where by the way you have done a fantastic job. It's rare to see a python package with this level of standard for namespace clarity.

Oct 20 '19 09:10 mikkokotila

The proposed way is over 2,000 times faster:

current way: 5.92 s ± 55.5 ms per loop
proposed way: 2.27 ms ± 16.7 µs per loop

# bonus for bigrams (and other more complex ops)
proposed way for bigrams: 16.9 ms ± 349 µs per loop

Oct 20 '19 10:10 mikkokotila

Sounds good to me but I'll let @drupchen confirm

Oct 21 '19 15:10 ngawangtrinley

@mikkokotila, I'm all for improvements! Thanks for the proposal.

Have you looked at how easy it is to set up different components for the Text class? Here's how you use them: https://github.com/Esukhia/botok/blob/master/tests/test_text.py#L169-L183 Have a look at the function signatures here for what the building blocks of the pipeline take in and spit out. They are all meant to be independant from one another, so it would be pretty easy to plug in any external code, class, function or whatever.

One good use of that modularity is to create a tokenizer pipe with the tokenizer instanciated outside of Text, so it does not need to be loaded into memory everytime tokenizing needs to be done in a Text object.

I wanted to make sure you understood what I meant to do with this pipeline. Thanks for your help!

Oct 23 '19 13:10 drupchen

statistics performance with tokenizer.list_word_types