pisa icon indicating copy to clipboard operation
pisa copied to clipboard

Tokenization in parse_plaintext_content

Open amallia opened this issue 5 years ago • 2 comments

Describe the bug Shouldn't we do proper tokenization in parse_plaintext_content too? https://github.com/pisa-engine/pisa/blob/4a739b2ec50d2faa1e3c57336337e4fe219e09ec/include/pisa/forward_index_builder.hpp#L60-L66

amallia avatar Jul 31 '19 14:07 amallia

@elshize what do you think about this?

amallia avatar Jan 09 '20 19:01 amallia

We probably should, but historically, we used this when we already had a stream of tokens coming, that's probably why it's like this. But yeah, it should be fixed.

elshize avatar Jan 15 '20 17:01 elshize

This has been done when revamping text analysis, this function uses EnglishTokenizer, same as when using HTML documents.

elshize avatar Feb 09 '23 02:02 elshize