pisa Improving the understanding of what PISA is

Given recent feedback from HN, we should look at improving how we explain PISA, and offer some benchmarks to common systems like Lucene and Tantivy (perhaps).

We also should document some things such as:

Use cases
Assumptions (in memory)
Target audience and why you would want to use it
Limitations
Algorithms implemented (in terms of the basics, ie top-k search, Boolean matching, etc)
Scale (some numbers about the collection sizes we can index and search, and some basic timings/sizes)

This is probably something we can build incrementally into the README or an ABOUT page.

Mar 16 '20 23:03 JMMackenzie

Hello. I maintain tantivy.

We have a simple benchmark that makes it possible to measure your engine against tantivy and lucene.

https://github.com/tantivy-search/search-benchmark-game

Mar 25 '20 13:03 fulmicoton

@fulmicoton Thanks! I noticed that many "union" type of queries are now faster with Lucene. Is that due to their recent implementation of BMW? Do you know? Is that a recent change?

Mar 25 '20 15:03 elshize

Yes this is due to BMWand.

You can see them in the bench if you select the top 10 collector and union queries. https://tantivy-search.github.io/bench/

Lucene >=8 is >2x faster than tantivy at those queries.

It was the other way around for Lucene <8.

Mar 26 '20 00:03 fulmicoton

Thanks to the awesome work of @amallia and @elshize (and the help of @fulmicoton ) we now feature in the Tantivy benchmark game: https://github.com/tantivy-search/search-benchmark-game

This is one step towards closing off this issue :-)

Apr 02 '20 21:04 JMMackenzie

Hi (@elshize @amallia @gustingonzalez)

Just added this: https://github.com/pisa-engine/pisa/blob/describe-pisa/ABOUT.md

Could you take a look and let me know what you think is missing/needs changing?

Apr 07 '20 01:04 JMMackenzie

It looks good. Nothing else to add comes to mind at the moment. You wrote: $50$ million, so you make sure to remove the $ signs, but other than that, it looks solid.

Apr 07 '20 01:04 elshize

It looks good. Nothing else to add comes to mind at the moment. You wrote: $50$ million, so you make sure to remove the $ signs, but other than that, it looks solid.

Whoops, too much TeX haha. Fixed, thanks.

Apr 07 '20 02:04 JMMackenzie

Hello everyone! Looks good. Maybe, in the first paragraph, it can be useful define the inverted index concept as the logical representation of a corpus.

Apr 07 '20 04:04 gustingonzalez

Hello everyone! Looks good. Maybe, in the first paragraph, it can be useful define the inverted index concept as the logical representation of a corpus.

Thanks, good suggestion. I included that and a link to Wikipedia's relevant area.

Apr 07 '20 04:04 JMMackenzie

Let's keep this open for now as a WIP, we can add some library examples as well. First step has been done via #359

Apr 14 '20 11:04 JMMackenzie

Shall we add a list of papers that use PISA? Here my list, are there any others? @elshize @elshize

https://dl.acm.org/doi/abs/10.1145/3373376.3378521
https://dl.acm.org/doi/abs/10.1145/3345001
https://dl.acm.org/doi/abs/10.1145/3331184.3331207
https://link.springer.com/chapter/10.1007/978-3-030-15712-8_23
https://dl.acm.org/doi/abs/10.14778/3384345.3384358
https://arxiv.org/abs/2003.08276
https://link.springer.com/chapter/10.1007/978-3-030-15712-8_52

Jun 01 '20 15:06 amallia

Nothing else comes to mind. @JMMackenzie ?

Jun 01 '20 17:06 elshize

When it becomes available this one: https://jmmackenzie.io/publication/sigir20-short/

Also the repro bisection paper: https://link.springer.com/chapter/10.1007/978-3-030-15712-8_22

Could also be a few more from SIGIR from other groups, but we'll have to wait to see. I usually keep an eye on ArXiv but haven't seen anyone using it recently.

Jun 01 '20 22:06 JMMackenzie

pisa pisa copied to clipboard

Improving the understanding of what PISA is

pisa
pisa copied to clipboard