python-bpe
python-bpe copied to clipboard
use tok as fallback tokenizer
The tokenizer makes for a better default (faster and saner).
Targets #10
For context, in NLP tasks it can be important to retain the usage of contractions, since it can be informative about other aspects of the author of the text.
@soaxelbrooke I personally think the potential benefit of retaining contraction information is more than compensated by the increased power of generalization by having it all normalized :)!
i.e. not
and '
/ t
are not generalized, did
and didn
don't generalize. I deem this to be worse!
I'm up for further discussion. Meanwhile, how you would propose adding it as an option without also having the nltk
dependency?
@kootenpv I'm fine with people having different preferences on how they'd like those split up, my biggest concern here is changing the interface, which is a breaking change that would introduce bugs into dependent code. I haven't heard any complaints about the presence of NLTK, so I don't see any particular need to remove it. We could expose tok
as an alternative word tokenizer in the Encoder
constructor.