Add character and word level tokenizers
Add simple character and word level tokenizers that conform to the LearnBase.jl getobs/nobs interfaces.
Ref: https://github.com/JuliaText/WordTokenizers.jl
Just a few questions. By this conformity do you mean another method for getobs,nobs functions treating a tokenized type? Since the tokenizer as the reference describes is essentially splitting text by spaces or simply into individual characters, would this would be used in the context of preprocessing text-based datasets? Would the new module and tests go into /src/methods and /test/methods respectively?
Also, how do I self-assign this issue?
No need to self-assign the issue. Just submit a PR when ready.
By conform, I mean defining a new getobs/nobs on the tokenizer type to call the underlying splitting methods. The functions should go in src/datasets/transformations.jl.