pisa
pisa copied to clipboard
Improving the understanding of what PISA is
Given recent feedback from HN, we should look at improving how we explain PISA, and offer some benchmarks to common systems like Lucene and Tantivy (perhaps).
We also should document some things such as:
- Use cases
- Assumptions (in memory)
- Target audience and why you would want to use it
- Limitations
- Algorithms implemented (in terms of the basics, ie top-k search, Boolean matching, etc)
- Scale (some numbers about the collection sizes we can index and search, and some basic timings/sizes)
This is probably something we can build incrementally into the README or an ABOUT page.
Hello. I maintain tantivy.
We have a simple benchmark that makes it possible to measure your engine against tantivy and lucene.
https://github.com/tantivy-search/search-benchmark-game
@fulmicoton Thanks! I noticed that many "union" type of queries are now faster with Lucene. Is that due to their recent implementation of BMW? Do you know? Is that a recent change?
Yes this is due to BMWand.
You can see them in the bench if you select the top 10 collector and union queries. https://tantivy-search.github.io/bench/
Lucene >=8 is >2x faster than tantivy at those queries.
It was the other way around for Lucene <8.
Thanks to the awesome work of @amallia and @elshize (and the help of @fulmicoton ) we now feature in the Tantivy benchmark game: https://github.com/tantivy-search/search-benchmark-game
This is one step towards closing off this issue :-)
Hi (@elshize @amallia @gustingonzalez)
Just added this: https://github.com/pisa-engine/pisa/blob/describe-pisa/ABOUT.md
Could you take a look and let me know what you think is missing/needs changing?
It looks good. Nothing else to add comes to mind at the moment. You wrote: $50$ million, so you make sure to remove the $ signs, but other than that, it looks solid.
It looks good. Nothing else to add comes to mind at the moment. You wrote: $50$ million, so you make sure to remove the $ signs, but other than that, it looks solid.
Whoops, too much TeX haha. Fixed, thanks.
Hello everyone! Looks good. Maybe, in the first paragraph, it can be useful define the inverted index concept as the logical representation of a corpus.
Hello everyone! Looks good. Maybe, in the first paragraph, it can be useful define the inverted index concept as the logical representation of a corpus.
Thanks, good suggestion. I included that and a link to Wikipedia's relevant area.
Let's keep this open for now as a WIP, we can add some library examples as well. First step has been done via #359
Shall we add a list of papers that use PISA? Here my list, are there any others? @elshize @elshize
- https://dl.acm.org/doi/abs/10.1145/3373376.3378521
- https://dl.acm.org/doi/abs/10.1145/3345001
- https://dl.acm.org/doi/abs/10.1145/3331184.3331207
- https://link.springer.com/chapter/10.1007/978-3-030-15712-8_23
- https://dl.acm.org/doi/abs/10.14778/3384345.3384358
- https://arxiv.org/abs/2003.08276
- https://link.springer.com/chapter/10.1007/978-3-030-15712-8_52
Nothing else comes to mind. @JMMackenzie ?
When it becomes available this one: https://jmmackenzie.io/publication/sigir20-short/
Also the repro bisection paper: https://link.springer.com/chapter/10.1007/978-3-030-15712-8_22
Could also be a few more from SIGIR from other groups, but we'll have to wait to see. I usually keep an eye on ArXiv but haven't seen anyone using it recently.