FastAI.jl icon indicating copy to clipboard operation
FastAI.jl copied to clipboard

Add character and word level tokenizers

Open darsnack opened this issue 3 years ago • 3 comments

Add simple character and word level tokenizers that conform to the LearnBase.jl getobs/nobs interfaces.

darsnack avatar Apr 02 '21 16:04 darsnack

Ref: https://github.com/JuliaText/WordTokenizers.jl

AriMKatz avatar Apr 05 '21 09:04 AriMKatz

Just a few questions. By this conformity do you mean another method for getobs,nobs functions treating a tokenized type? Since the tokenizer as the reference describes is essentially splitting text by spaces or simply into individual characters, would this would be used in the context of preprocessing text-based datasets? Would the new module and tests go into /src/methods and /test/methods respectively?

Also, how do I self-assign this issue?

samuelzxu avatar Apr 06 '21 03:04 samuelzxu

No need to self-assign the issue. Just submit a PR when ready.

By conform, I mean defining a new getobs/nobs on the tokenizer type to call the underlying splitting methods. The functions should go in src/datasets/transformations.jl.

darsnack avatar Apr 06 '21 18:04 darsnack