FastAI.jl Add character and word level tokenizers

Add simple character and word level tokenizers that conform to the LearnBase.jl getobs/nobs interfaces.

Apr 02 '21 16:04 darsnack

Ref: https://github.com/JuliaText/WordTokenizers.jl

Apr 05 '21 09:04 AriMKatz

Just a few questions. By this conformity do you mean another method for getobs,nobs functions treating a tokenized type? Since the tokenizer as the reference describes is essentially splitting text by spaces or simply into individual characters, would this would be used in the context of preprocessing text-based datasets? Would the new module and tests go into /src/methods and /test/methods respectively?

Also, how do I self-assign this issue?

Apr 06 '21 03:04 samuelzxu

No need to self-assign the issue. Just submit a PR when ready.

By conform, I mean defining a new getobs/nobs on the tokenizer type to call the underlying splitting methods. The functions should go in src/datasets/transformations.jl.

Apr 06 '21 18:04 darsnack