pisa
pisa copied to clipboard
Tokenization in parse_plaintext_content
Describe the bug
Shouldn't we do proper tokenization in parse_plaintext_content
too?
https://github.com/pisa-engine/pisa/blob/4a739b2ec50d2faa1e3c57336337e4fe219e09ec/include/pisa/forward_index_builder.hpp#L60-L66
@elshize what do you think about this?
We probably should, but historically, we used this when we already had a stream of tokens coming, that's probably why it's like this. But yeah, it should be fixed.
This has been done when revamping text analysis, this function uses EnglishTokenizer
, same as when using HTML documents.