simplemma icon indicating copy to clipboard operation
simplemma copied to clipboard

Simple Tokenizer not separating punctuation correctly

Open juanjoDiaz opened this issue 3 years ago • 4 comments

It seems that punctuation symbols are not correctly separated.

>>> simple_tokenizer('"test": "that"')
['"', 'test', '":', '"', 'that', '"']

I would have expected: ['"', 'test', '"', ':', '"', 'that', '"']

P.S.: I would rather to review my big refactor before applying any other changes to avoid a constant conflict in the PR 🙂

juanjoDiaz avatar Jan 19 '23 13:01 juanjoDiaz

Thanks for the feedback!

The tokenizer does something slightly different than usually expected: it clusters chars together while segmenting the input. Since the output only consists of lemmata the idea is to keep it simple and to group punctuation signs because they're not relevant in this case.

Maybe the name could be changed (word tokenizer?), this behavior should at least be documented.

adbar avatar Jan 19 '23 17:01 adbar

Yes, your PR has the priority now!

adbar avatar Jan 19 '23 17:01 adbar

Hi @adbar,

Coming back to this. Is there any reason to group symbols? Performance or something else? Or was this just because it was simpler and it does the job as symbols are ignored?

juanjoDiaz avatar May 12 '23 17:05 juanjoDiaz

Yes, it's faster and simpler. Otherwise you would have to tokenize punctuation accurately (which is a different task) and run the lemmatizer on it (which is useless in the current context and only means additional processing time).

adbar avatar May 12 '23 19:05 adbar