pisa icon indicating copy to clipboard operation
pisa copied to clipboard

Improving the understanding of what PISA is

Open JMMackenzie opened this issue 4 years ago • 13 comments

Given recent feedback from HN, we should look at improving how we explain PISA, and offer some benchmarks to common systems like Lucene and Tantivy (perhaps).

We also should document some things such as:

  • Use cases
  • Assumptions (in memory)
  • Target audience and why you would want to use it
  • Limitations
  • Algorithms implemented (in terms of the basics, ie top-k search, Boolean matching, etc)
  • Scale (some numbers about the collection sizes we can index and search, and some basic timings/sizes)

This is probably something we can build incrementally into the README or an ABOUT page.

JMMackenzie avatar Mar 16 '20 23:03 JMMackenzie

Hello. I maintain tantivy.

We have a simple benchmark that makes it possible to measure your engine against tantivy and lucene.

https://github.com/tantivy-search/search-benchmark-game

fulmicoton avatar Mar 25 '20 13:03 fulmicoton

@fulmicoton Thanks! I noticed that many "union" type of queries are now faster with Lucene. Is that due to their recent implementation of BMW? Do you know? Is that a recent change?

elshize avatar Mar 25 '20 15:03 elshize

Yes this is due to BMWand.

You can see them in the bench if you select the top 10 collector and union queries. https://tantivy-search.github.io/bench/

Lucene >=8 is >2x faster than tantivy at those queries.

It was the other way around for Lucene <8.

fulmicoton avatar Mar 26 '20 00:03 fulmicoton

Thanks to the awesome work of @amallia and @elshize (and the help of @fulmicoton ) we now feature in the Tantivy benchmark game: https://github.com/tantivy-search/search-benchmark-game

This is one step towards closing off this issue :-)

JMMackenzie avatar Apr 02 '20 21:04 JMMackenzie

Hi (@elshize @amallia @gustingonzalez)

Just added this: https://github.com/pisa-engine/pisa/blob/describe-pisa/ABOUT.md

Could you take a look and let me know what you think is missing/needs changing?

JMMackenzie avatar Apr 07 '20 01:04 JMMackenzie

It looks good. Nothing else to add comes to mind at the moment. You wrote: $50$ million, so you make sure to remove the $ signs, but other than that, it looks solid.

elshize avatar Apr 07 '20 01:04 elshize

It looks good. Nothing else to add comes to mind at the moment. You wrote: $50$ million, so you make sure to remove the $ signs, but other than that, it looks solid.

Whoops, too much TeX haha. Fixed, thanks.

JMMackenzie avatar Apr 07 '20 02:04 JMMackenzie

Hello everyone! Looks good. Maybe, in the first paragraph, it can be useful define the inverted index concept as the logical representation of a corpus.

gustingonzalez avatar Apr 07 '20 04:04 gustingonzalez

Hello everyone! Looks good. Maybe, in the first paragraph, it can be useful define the inverted index concept as the logical representation of a corpus.

Thanks, good suggestion. I included that and a link to Wikipedia's relevant area.

JMMackenzie avatar Apr 07 '20 04:04 JMMackenzie

Let's keep this open for now as a WIP, we can add some library examples as well. First step has been done via #359

JMMackenzie avatar Apr 14 '20 11:04 JMMackenzie

Shall we add a list of papers that use PISA? Here my list, are there any others? @elshize @elshize

  • https://dl.acm.org/doi/abs/10.1145/3373376.3378521
  • https://dl.acm.org/doi/abs/10.1145/3345001
  • https://dl.acm.org/doi/abs/10.1145/3331184.3331207
  • https://link.springer.com/chapter/10.1007/978-3-030-15712-8_23
  • https://dl.acm.org/doi/abs/10.14778/3384345.3384358
  • https://arxiv.org/abs/2003.08276
  • https://link.springer.com/chapter/10.1007/978-3-030-15712-8_52

amallia avatar Jun 01 '20 15:06 amallia

Nothing else comes to mind. @JMMackenzie ?

elshize avatar Jun 01 '20 17:06 elshize

When it becomes available this one: https://jmmackenzie.io/publication/sigir20-short/

Also the repro bisection paper: https://link.springer.com/chapter/10.1007/978-3-030-15712-8_22

Could also be a few more from SIGIR from other groups, but we'll have to wait to see. I usually keep an eye on ArXiv but haven't seen anyone using it recently.

JMMackenzie avatar Jun 01 '20 22:06 JMMackenzie