python-bpe icon indicating copy to clipboard operation
python-bpe copied to clipboard

use tok as fallback tokenizer

Open kootenpv opened this issue 4 years ago • 3 comments

The tokenizer makes for a better default (faster and saner).

Targets #10

kootenpv avatar Jul 09 '19 18:07 kootenpv

For context, in NLP tasks it can be important to retain the usage of contractions, since it can be informative about other aspects of the author of the text.

soaxelbrooke avatar Jul 10 '19 20:07 soaxelbrooke

@soaxelbrooke I personally think the potential benefit of retaining contraction information is more than compensated by the increased power of generalization by having it all normalized :)!

i.e. not and ' / t are not generalized, did and didn don't generalize. I deem this to be worse!

I'm up for further discussion. Meanwhile, how you would propose adding it as an option without also having the nltk dependency?

kootenpv avatar Jul 11 '19 07:07 kootenpv

@kootenpv I'm fine with people having different preferences on how they'd like those split up, my biggest concern here is changing the interface, which is a breaking change that would introduce bugs into dependent code. I haven't heard any complaints about the presence of NLTK, so I don't see any particular need to remove it. We could expose tok as an alternative word tokenizer in the Encoder constructor.

soaxelbrooke avatar Jul 15 '19 16:07 soaxelbrooke