mygene.info icon indicating copy to clipboard operation
mygene.info copied to clipboard

tuning of default search scoring for wildcard searches

Open andrewsu opened this issue 6 years ago • 3 comments

I believe we have scoring in place to prioritize human, mouse, and rat over other species, and that seems to be working well:

https://mygene.info/v3/query?q=BRCA2

But scoring with wildcard searches is far from perfect, eg:

https://mygene.info/v3/query?q=BRCA*&fields=symbol,name,alias

{
  "max_score": 1.55,
  "took": 15,
  "total": 5502,
  "hits": [
    {
      "_id": "106721785",
      "_score": 1.55,
      "name": "BRCA2 promoter\/silencer region",
      "symbol": "LOC106721785"
    },
    {
      "_id": "79184",
      "_score": 1.55,
      "alias": [
        "BRCC36",
        "C6.1A",
        "CXorf53"
      ],
      "name": "BRCA1\/BRCA2-containing complex subunit 3",
      "symbol": "BRCC3"
    },
    {
      "_id": "11200",
      "_score": 1.55,
      "alias": [
        "CDS1",
        "CHK2",
        "HuCds1",
        "LFS2",
        "PP1425",
        "RAD53",
        "hCds1"
      ],
      "name": "checkpoint kinase 2",
      "symbol": "CHEK2"
    },
    {
      "_id": "56647",
      "_score": 1.55,
      "alias": [
        "TOK-1",
        "TOK1"
      ],
      "name": "BRCA2 and CDKN1A interacting protein",
      "symbol": "BCCIP"
    },
    {
      "_id": "5932",
      "_score": 1.55,
      "alias": [
        "COM1",
        "CTIP",
        "JWDS",
        "RIM",
        "SAE2",
        "SCKL2"
      ],
      "name": "RB binding protein 8, endonuclease",
      "symbol": "RBBP8"
    },
    {
      "_id": "672",
      "_score": 1.55,
      "alias": [
        "BRCAI",
        "BRCC1",
        "BROVCA1",
        "FANCS",
        "IRIS",
        "PNCA4",
        "PPP1R53",
        "PSCP",
        "RNF53"
      ],
      "name": "BRCA1, DNA repair associated",
      "symbol": "BRCA1"
    },
    {
      "_id": "111589216",
      "_score": 1.55,
      "name": "BRCA1 intron 2 regulatory region",
      "symbol": "LOC111589216"
    },
    {
      "_id": "1845",
      "_score": 1.55,
      "alias": "VHR",
      "name": "dual specificity phosphatase 3",
      "symbol": "DUSP3"
    },
    {
      "_id": "57697",
      "_score": 1.55,
      "alias": [
        "FAAP250",
        "KIAA1596"
      ],
      "name": "Fanconi anemia complementation group M",
      "symbol": "FANCM"
    },
    {
      "_id": "29086",
      "_score": 1.55,
      "alias": [
        "C19orf62",
        "HSPC142",
        "MERIT40",
        "NBA1"
      ],
      "name": "BRISC and BRCA1 A complex member 1",
      "symbol": "BABAM1"
    }
  ]
}

In my mind, the scoring should preferentially weight matches to symbol (e.g., BRCA1 and BRCA2), then aliases (e.g., LINC01488/BRCAT8 and LINC02224/BRCAT107, both not shown here in the top 10), then name (e.g., BRCC3 and BCCIP), then any other field (e.g., CHEK2, RBBP8).

More info and context: https://github.com/cognoma/frontend/issues/169

andrewsu avatar Apr 04 '18 16:04 andrewsu

Also note that some people (eg our cognoma friends) may be relying on the existing scoring scheme for production applications. So we should consult with them before pushing major changes in our scoring scheme.

andrewsu avatar Apr 04 '18 16:04 andrewsu

@andrewsu another option for the folks at cognoma is to use the "userquery" option to get exactly what they want from our db by defining their own query. Just as a quick example, I implemented a parameterized weighting with our existing wildcard query (which has no weighting, giving the results you describe). This query template is in our biothings.userqueries repo (which they can submit a pull request to), here:

https://github.com/biothings/biothings.userqueries/blob/production/mygene/weighted_wildcard/query.txt

In addition to the query search term "q", this has 3 extra parameters for the weight of each term the wildcard query searches (symbol, name, summary). To use these, they must be prefaced by "uq_" in the url string. An example of how this might work (using symbol weight 2, name weight 1, and summary weight 0.5) for brca* is shown below.

http://mygene.info/v3/query?q=brca*&userquery=weighted_wildcard&uq_symbol_weight=2&uq_name_weight=1&uq_summary_weight=0.5

cyrus0824 avatar Apr 04 '18 17:04 cyrus0824

I do think we should weight our wildcard searches natively, however....

But for use cases where their intention diverges from our own, allowing them to define their queries could be a useful alternative

cyrus0824 avatar Apr 04 '18 17:04 cyrus0824