FastAI.jl
FastAI.jl copied to clipboard
Add character and word level tokenizers
Add simple character and word level tokenizers that conform to the LearnBase.jl getobs
/nobs
interfaces.
Ref: https://github.com/JuliaText/WordTokenizers.jl
Just a few questions. By this conformity do you mean another method for getobs
,nobs
functions treating a tokenized type? Since the tokenizer as the reference describes is essentially splitting text by spaces or simply into individual characters, would this would be used in the context of preprocessing text-based datasets? Would the new module and tests go into /src/methods
and /test/methods
respectively?
Also, how do I self-assign this issue?
No need to self-assign the issue. Just submit a PR when ready.
By conform, I mean defining a new getobs
/nobs
on the tokenizer type to call the underlying splitting methods. The functions should go in src/datasets/transformations.jl
.