pisa
pisa copied to clipboard
PISA: Performant Indexes and Search for Academia
Disk-streamed indexing can result in non-deterministic indexes if there is insufficient /tmp storage
**Describe the bug** As the title suggests; I had some issues with segfaults upon mmapping an index. It turned out that the index was being built (but incorrectly) because there...
The following function has issues: https://github.com/pisa-engine/pisa/blob/c6481af140224070aaa5d7ec109bbde396268b8c/include/pisa/bit_vector.hpp#L285 1. We cast an arbitrary byte pointer to int, making it UB. 2. We assume the bit vector has enough bytes to actually dereference....
Because the implementation of block-wise `decode` (and `encode` but this one is not as crucial) is moved out of the header, there is a potential that this will affect performance....
Opening another PR because I don't know how easy/tricky this will turn out to be.
Dear my friends, First thank you all for the great project ! This search engine is the most fancy I've found on Github ! In our case, we will have...
For some weird reason reordering by URL does not work when using https://github.com/pisa-engine/pisa/blob/master/tools/reorder_docids.cpp It does work if we use this external script instead: https://github.com/pisa-engine/pisa/blob/master/script/generate_sorted_docids_mapping.py
**Describe the solution you'd like** For CJK languages, like for example Chinese, words are not separated by spaces. So there usually has a need to use a tokenizer to split...
Below is a rough draft of a schema/config/meta file (idk what name fits best here) to organize the files together. The primary goal is to have sane defaults such that...
Currently, term weighting is handled within the `Cursors` classes. In particular, the `ScoredCursor` class stores the query term weight (the weight assigned to a term at query time, usually set...
**Describe the bug** The docs at https://pisa.readthedocs.io/en/latest/compress_index.html#usage talk about using `create_freq_index` which is what the binary was called in `ds2i`. It looks like it was renamed to `compress_inverted_index` and additional...