simplemma Simple Tokenizer not separating punctuation correctly

It seems that punctuation symbols are not correctly separated.

>>> simple_tokenizer('"test": "that"')
['"', 'test', '":', '"', 'that', '"']

I would have expected: ['"', 'test', '"', ':', '"', 'that', '"']

P.S.: I would rather to review my big refactor before applying any other changes to avoid a constant conflict in the PR 🙂

Jan 19 '23 13:01 juanjoDiaz

Thanks for the feedback!

The tokenizer does something slightly different than usually expected: it clusters chars together while segmenting the input. Since the output only consists of lemmata the idea is to keep it simple and to group punctuation signs because they're not relevant in this case.

Maybe the name could be changed (word tokenizer?), this behavior should at least be documented.

Jan 19 '23 17:01 adbar

Yes, your PR has the priority now!

Jan 19 '23 17:01 adbar

Hi @adbar,

Coming back to this. Is there any reason to group symbols? Performance or something else? Or was this just because it was simpler and it does the job as symbols are ignored?

May 12 '23 17:05 juanjoDiaz

Yes, it's faster and simpler. Otherwise you would have to tokenize punctuation accurately (which is a different task) and run the lemmatizer on it (which is useless in the current context and only means additional processing time).

May 12 '23 19:05 adbar