WordTokenizers.jl icon indicating copy to clipboard operation
WordTokenizers.jl copied to clipboard

`tokenize` API

Open MikeInnes opened this issue 5 years ago • 9 comments

The set_tokenizer API seems a bit suspect here, given that it can be replaced with

const tokenize = WordTokenizers.nltk_tokenize

and likewise for RevTok etc, without bringing in multiple packages just to define an alias :)

I also think it's generally a good idea to expose people to higher order functions and such; people might not realise that you can just e.g. pass a custom tokenize function into a constructor rather than setting and unsetting it globally.

MikeInnes avatar Oct 12 '18 08:10 MikeInnes

perhaps. Mu original thought was that it would be good if the average user didn't have to worry about what tokenizers were available, and could just say tokenize this.

But also the more advanced user might want to configure it, and have it apply globally -- even into other packages.

However the big issue I see with that really the tokenizer is corpus specific. So the idea of a settable global default is perhaps silly If we thing about handling different languages, or even texts from twitter vs from the newpaper articles, you want a different tokenzer. So making this set able globally might not be a good idea.

Related: We actually should be thinking in terms of languages, like Embeddings,jl is.

I am thinking more like:

const tokenize = tokenizer(English()) # use the default enlish tokenizer
const tokenize = tokenizer(English(), 2) # use the second english tokenizer.

and we should expose list_tokenizers(::Language) which gives a list of suitable tokenizers. (E.g. TokTok #5 is good for a bunch of languages, where as Penn is only good for English) More generally: we can maybe attach traits to the tokenizer functions, Traits for language, traits for reversability, which might be better then one can say:

const tokenize = tokenizer(English(), Reversible() , URLsSupported())

(This should be using Languages.jl for type-based language ids here, and there cf https://github.com/JuliaText/Embeddings.jl/issues/6)

oxinabox avatar Oct 16 '18 03:10 oxinabox

@oxinabox Would it be a good idea to add a traits function which takes any tokenizer as input and gives info. about the tokenizer, which could be potentially used in the above proposed tokenizer approach?

aquatiko avatar Apr 03 '19 13:04 aquatiko

I think multiple different trait functions. Starting with language

oxinabox avatar Apr 03 '19 14:04 oxinabox

I really just think docs strings would be better here. It's a case of KISS until there's a clear need for any more complexity.

MikeInnes avatar Apr 03 '19 15:04 MikeInnes

Yeah, See the nice thing to do though would be to have a default for languages.

Then the same for Embeddings.jl (which almost does this) see https://github.com/JuliaText/Embeddings.jl/issues/6

Then we could do things like:

LANG = Languages.detect_language(corpus)
tokenizer = Tokenizers.tokenizer(LANG)
words = tokenizer(corpus)
vocab = unique(words)
embtable = Embeddings.load_embeddings(LANG, vocab)

onehot_encoder = onehot_encoder(length(vocab))
mdl = model(emtable.embeddings)
train!(mdl, onehot_encoder.(words))

oxinabox avatar Apr 03 '19 17:04 oxinabox

That's a good use case, although even then, wasn't #14 meant to implement something fairly general and language-agnostic? It seems better to have the same default for all languages if at all possible.

MikeInnes avatar Apr 03 '19 17:04 MikeInnes

#18 is fairly general and langauge agnostic, and is now the default. But it is still basically useless in a ton of languages, it is still space centric. Further we don't have any tokenizer for any language yet (including english) that are better than that.

So until we do this is not really pressing, as the answer would always be use toktok

oxinabox avatar Apr 03 '19 18:04 oxinabox

I think that the Tokenizer API should also be able to expose the TokenBuffer API and its various lexer functions for building custom tokenizers.

Ayushk4 avatar Aug 21 '19 15:08 Ayushk4

I think that the Tokenizer API should also be able to expose the TokenBuffer API and its various lexer functions for building custom tokenizers.

I'm not sure how that would be. They act at a different level. The TokenBuffer API makes tokenizers.

The Tokenizer API specifies what should happen when you call tokenizer(str) or split(str, (Words(), Sentences()) (IIRC)

oxinabox avatar Aug 21 '19 15:08 oxinabox