physionet-build icon indicating copy to clipboard operation
physionet-build copied to clipboard

Seeking a good search engine for PhysioNet

Open bemoody opened this issue 7 months ago • 3 comments

The current PhysioNet search function is not great (previous issues: #349, #1971). We would like to replace it with something based on a "real" information-retrieval engine, while also allowing more powerful and user-friendly queries.

There are a few options and in this issue I'll try to list advantages/disadvantages of each.

Requirements:

  • Free and open-source software
  • Reasonable security support

Good to have:

  • Django integration - Haystack (https://haystacksearch.org/), for example, makes it easy to index and search objects in the Django ORM
  • Language support - PhysioNet only publishes projects written in English, but we would like the platform to be international
  • Exact word searching - ability to search for a term without stemming or synonyms (often written +foo or "foo")
  • Phrase searching - ability to search for an exact phrase ("foo bar")
  • Range queries - e.g. "projects published between 2021-06-01 and 2021-09-01"
  • Faceting - e.g. "list the distinct authors of matching projects and the number of matching projects for each author"
  • Collapsing - e.g. "search for published projects matching the query, then list distinct core projects ordered by relevance"
  • Synonyms - e.g. treating ecg and electrocardiogram as equivalent
  • User-friendly query parser - if the query parser supports complex syntax, providing diagnostics so you can understand why your query isn't working

Some options we might consider:

  • Xapian
  • Whoosh
  • Solr
  • OpenSearch
  • Manticore
  • PostgreSQL

bemoody avatar Jan 18 '24 19:01 bemoody