python-bpe use tok as fallback tokenizer

use tok as fallback tokenizer

Open kootenpv opened this issue 4 years ago • 3 comments

The tokenizer makes for a better default (faster and saner).

Targets #10

Jul 09 '19 18:07 kootenpv

For context, in NLP tasks it can be important to retain the usage of contractions, since it can be informative about other aspects of the author of the text.

Jul 10 '19 20:07 soaxelbrooke

@soaxelbrooke I personally think the potential benefit of retaining contraction information is more than compensated by the increased power of generalization by having it all normalized :)!

i.e. not and ' / t are not generalized, did and didn don't generalize. I deem this to be worse!

I'm up for further discussion. Meanwhile, how you would propose adding it as an option without also having the nltk dependency?

Jul 11 '19 07:07 kootenpv

@kootenpv I'm fine with people having different preferences on how they'd like those split up, my biggest concern here is changing the interface, which is a breaking change that would introduce bugs into dependent code. I haven't heard any complaints about the presence of NLTK, so I don't see any particular need to remove it. We could expose tok as an alternative word tokenizer in the Encoder constructor.

Jul 15 '19 16:07 soaxelbrooke

python-bpe python-bpe copied to clipboard

use tok as fallback tokenizer

python-bpe
python-bpe copied to clipboard