tokenizer icon indicating copy to clipboard operation
tokenizer copied to clipboard

A simple tokenizer in Ruby for NLP tasks.

Results 5 tokenizer issues
Sort by recently updated
recently updated
newest added

The initialization options are ignored. The following splitters could be incorporated: ‘, ’, ”, `

Fix #6 Created new splittable `PRE_N_POST_ONLY` which holds characters which can be both prefixes and suffixes but are only a splittable if at the beginning or end of a token...

Tokenizer::WhitespaceTokenizer.new.tokenize "et souligne l'interrelation étroite de l'imagerie avec le comportement" => ["et", "souligne", "l", "'", "i", "n", "t", "e", "r", "r", "e", "l", "a", "t", "i", "o", "n", "étroite", "de",...

`Tokenizer::WhitespaceTokenizer.new.tokenize "www.google.com"` ` => ["www", ".", "g","o","o","g","l","e",".","c","o","m"]` I want the website urls to be tokenized as a single noun effectively so would expect www.google.com to tokenize as "www.google.com". I am...