pisa
pisa copied to clipboard
Tokenization in parse_plaintext_content
Describe the bug
Shouldn't we do proper tokenization in parse_plaintext_content too?
https://github.com/pisa-engine/pisa/blob/4a739b2ec50d2faa1e3c57336337e4fe219e09ec/include/pisa/forward_index_builder.hpp#L60-L66
@elshize what do you think about this?
We probably should, but historically, we used this when we already had a stream of tokens coming, that's probably why it's like this. But yeah, it should be fixed.
This has been done when revamping text analysis, this function uses EnglishTokenizer, same as when using HTML documents.