Simple Tokenizer not separating punctuation correctly
It seems that punctuation symbols are not correctly separated.
>>> simple_tokenizer('"test": "that"')
['"', 'test', '":', '"', 'that', '"']
I would have expected: ['"', 'test', '"', ':', '"', 'that', '"']
P.S.: I would rather to review my big refactor before applying any other changes to avoid a constant conflict in the PR 🙂
Thanks for the feedback!
The tokenizer does something slightly different than usually expected: it clusters chars together while segmenting the input. Since the output only consists of lemmata the idea is to keep it simple and to group punctuation signs because they're not relevant in this case.
Maybe the name could be changed (word tokenizer?), this behavior should at least be documented.
Yes, your PR has the priority now!
Hi @adbar,
Coming back to this. Is there any reason to group symbols? Performance or something else? Or was this just because it was simpler and it does the job as symbols are ignored?
Yes, it's faster and simpler. Otherwise you would have to tokenize punctuation accurately (which is a different task) and run the lemmatizer on it (which is useless in the current context and only means additional processing time).