biolink-api icon indicating copy to clipboard operation
biolink-api copied to clipboard

Include taxon id with taxon label in facet count of entity search endpoint

Open vincerubinetti opened this issue 2 years ago • 5 comments

I'm developing the 3.0 version of the monarch ui/website, and I've run into a limitation. @putmantime

Here is an example response from the /search/entity/{term} endpoint, searching "ssh":

{
  "numFound": 177,
  "docs": [
    {
      "id": "FlyBase:FBgn0029157",
      "id_std": "FlyBase:FBgn0029157",
      "id_eng": "FlyBase:FBgn0029157",
      "id_kw": "FlyBase:FBgn0029157",
      "prefix": "FlyBase",
      "label": ["ssh"],
      "label_std": ["ssh"],
      "label_eng": ["ssh"],
      "label_kw": ["ssh"],
      "edges": 319,
      "taxon": "NCBITaxon:7227",
      "taxon_std": "NCBITaxon:7227",
      "taxon_eng": "NCBITaxon:7227",
      "taxon_kw": "NCBITaxon:7227",
      "taxon_label": "Drosophila melanogaster",
      "taxon_label_std": "Drosophila melanogaster",
      "taxon_label_eng": "Drosophila melanogaster",
      "taxon_label_kw": "Drosophila melanogaster",
      "taxon_label_synonym": ["fruit fly", "Sophophora melanogaster"],
      "taxon_label_synonym_std": ["fruit fly", "Sophophora melanogaster"],
      "taxon_label_synonym_eng": ["fruit fly", "Sophophora melanogaster"],
      "taxon_label_synonym_kw": ["fruit fly", "Sophophora melanogaster"],
      "has_phenotype": false,
      "category": ["gene", "sequence feature"],
      "category_std": ["gene", "sequence feature"],
      "category_eng": ["gene", "sequence feature"],
      "category_kw": ["gene", "sequence feature"],
      "synonym": [
        "slingshot",
        "Dmel\\CG6238",
        "SSH",
        "Ssh",
        "MKP-like",
        "Slingshot",
        "CG6238-PA",
        "Mkph",
        "CG6238-PB",
        "CG6238",
        "MKP",
        "CG6238-PC",
        "CG6238-PD",
        "ssh-PB",
        "ssh-PA",
        "ssh-PD",
        "ssh-PC",
        "l(3)01207",
        "MAP-kinase-phosphatase"
      ],
      "synonym_std": [
        "slingshot",
        "Dmel\\CG6238",
        "SSH",
        "Ssh",
        "MKP-like",
        "Slingshot",
        "CG6238-PA",
        "Mkph",
        "CG6238-PB",
        "CG6238",
        "MKP",
        "CG6238-PC",
        "CG6238-PD",
        "ssh-PB",
        "ssh-PA",
        "ssh-PD",
        "ssh-PC",
        "l(3)01207",
        "MAP-kinase-phosphatase"
      ],
      "synonym_eng": [
        "slingshot",
        "Dmel\\CG6238",
        "SSH",
        "Ssh",
        "MKP-like",
        "Slingshot",
        "CG6238-PA",
        "Mkph",
        "CG6238-PB",
        "CG6238",
        "MKP",
        "CG6238-PC",
        "CG6238-PD",
        "ssh-PB",
        "ssh-PA",
        "ssh-PD",
        "ssh-PC",
        "l(3)01207",
        "MAP-kinase-phosphatase"
      ],
      "synonym_kw": [
        "slingshot",
        "Dmel\\CG6238",
        "SSH",
        "Ssh",
        "MKP-like",
        "Slingshot",
        "CG6238-PA",
        "Mkph",
        "CG6238-PB",
        "CG6238",
        "MKP",
        "CG6238-PC",
        "CG6238-PD",
        "ssh-PB",
        "ssh-PA",
        "ssh-PD",
        "ssh-PC",
        "l(3)01207",
        "MAP-kinase-phosphatase"
      ],
      "equivalent_curie": [
        "FB:FBgn0029157",
        "NCBIGene:42986",
        "NCBI-Gene:42986",
        "NCBI.Gene:42986",
        "Entrez:42986",
        "Entrez.Gene:42986",
        "EntrezGene:42986",
        "Entrez-Gene:42986",
        "Gene:42986",
        "ENSEMBL:FBgn0029157"
      ],
      "equivalent_curie_std": [
        "FB:FBgn0029157",
        "NCBIGene:42986",
        "NCBI-Gene:42986",
        "NCBI.Gene:42986",
        "Entrez:42986",
        "Entrez.Gene:42986",
        "EntrezGene:42986",
        "Entrez-Gene:42986",
        "Gene:42986",
        "ENSEMBL:FBgn0029157"
      ],
      "equivalent_curie_eng": [
        "FB:FBgn0029157",
        "NCBIGene:42986",
        "NCBI-Gene:42986",
        "NCBI.Gene:42986",
        "Entrez:42986",
        "Entrez.Gene:42986",
        "EntrezGene:42986",
        "Entrez-Gene:42986",
        "Gene:42986",
        "ENSEMBL:FBgn0029157"
      ],
      "equivalent_curie_kw": [
        "FB:FBgn0029157",
        "NCBIGene:42986",
        "NCBI-Gene:42986",
        "NCBI.Gene:42986",
        "Entrez:42986",
        "Entrez.Gene:42986",
        "EntrezGene:42986",
        "Entrez-Gene:42986",
        "Gene:42986",
        "ENSEMBL:FBgn0029157"
      ],
      "leaf": true,
      "_version_": 1696524917734899700,
      "score": 117.35552
    }
  ],
  "facet_counts": {
    "category": {
    },
    "taxon_label": {
      "Sus scrofa": 25,
      "Drosophila melanogaster": 21,
      "Homo sapiens": 18,
      "Mus musculus": 16,
      "Bos taurus": 6,
      "Saccharomyces cerevisiae S288C": 6,
      "Xenopus tropicalis": 6,
      "Danio rerio": 5,
      "Gallus gallus": 4,
      "Anolis carolinensis": 3,
      "Canis lupus familiaris": 3,
      "Felis catus": 3,
      "Macaca mulatta": 3,
      "Monodelphis domestica": 3,
      "Ornithorhynchus anatinus": 3,
      "Pan troglodytes": 3,
      "Rattus norvegicus": 3,
      "Takifugu rubripes": 3,
      "Equus caballus": 2
    }
  },
  "highlighting": {}
}

Notice that taxon_label is being returned for facets, instead of taxon (id). This is nice for displaying a list of taxon facets, but not for actually filtering by them, because the endpoint only supports filtering by taxon (id), not taxon_label.

This requires the frontend to make a hard-coded label to id mapping for taxons. This duplicates information that we already have in biolink, is brittle, and is likely to get out of sync.

And yes, I can look up taxon from docs by finding the corresponding taxon_label field. However, then I would need to make sure all results are in docs so I have all the mappings, and that might go beyond the max rows [per page] param.


Possible solutions:

  • Support a taxon_label filter parameter (in addition to the taxon parameter) in the search endpoint. I guess this would be most useful if it was an exact match, rather than a fuzzy match. If there are multiple taxon ids that map to the same exact taxon label, then this option wouldn't be viable.

  • Return an additional taxon field in facet_counts with all the information I need: id, label, and count. This would leave the taxon_label facet untouched so current applications using biolink don't suddenly break.

  • Have some kind of taxon_map field at the top level of the response so I can go from label to id easily. Though, I think this is pretty ugly... don't want to add a top level thing for a special exception for just one type of facet.

vincerubinetti avatar Feb 22 '22 21:02 vincerubinetti

It's not exactly what you're asking for, but would a facet structure like this work?:

"facet_counts": {
    "category": {
        "disease": 27,
        "publication": 9,
        "anatomical entity": 5,
        "cell": 5,
        "gene": 2,
        "sequence feature": 2,
        "phenotype": 1,
        "quality": 1
    },
    "taxon": {
        "NCBITaxon:9031": 1,
        "NCBITaxon:9606": 1
    },
    "taxon_label": {
        "Gallus gallus": 1,
        "Homo sapiens": 1
    },
    "_taxon_map": {
        "NCBITaxon:9031": {
            "Gallus gallus": 1
        },
        "NCBITaxon:9606": {
            "Homo sapiens": 1
        }
    }
}

Two things are different here: 1) there's a new taxon facet that groups results by taxon ID, and 2) there's a _taxon_map entry in facet_counts that groups first by taxon ID, then by taxon label, with the value being the count of both that ID and label. AFAIK there should be a one-to-one mapping between ID and label, so there'll always just be one child of the ID node, but just in case there isn't this structure will still work.

If so, I have this implemented in my fork of the ontobio library -- here's where the _taxon_map key is injected into the facet counts: https://github.com/falquaddoomi/ontobio/blob/92231d447a/ontobio/golr/golr_query.py#L603. I assume we'll have to figure out who downstream might be affected by this...maybe the best way is to submit a PR?

falquaddoomi avatar Mar 01 '22 16:03 falquaddoomi

That's fine with me. If this is easier to implement or more consistent with how other things and data structures in biolink are implmented, I'd say go for it.

vincerubinetti avatar Mar 01 '22 17:03 vincerubinetti

Is the main reason you chose that structure because it supports 1 to many id to label mappings Faisal? I don't believe that will be the case as we have chosen the NCBI id/label pair for a taxon.
If what I say is true I think the most explicit and easily readable structure would be an object for each with clear attributes. "_taxon_map": [{ "label": "Gallus gallus", "id": "NCBITaxon:9031", "count": 1 } ]

But is a list of objects going to cause even more issues in this case @vincerubinetti ?

putmantime avatar Mar 01 '22 17:03 putmantime

I formatted it that way partly because I wasn't sure if there might be more than one label that matches a given taxon ID, and also because that structure kind of more closely matches how facet pivots are returned from Solr. If IDs and labels are in fact one-to-one I agree that the structure you proposed is more readable, and it's a trivial change on my end.

falquaddoomi avatar Mar 01 '22 17:03 falquaddoomi

Let me do some research and see if I can confirm 1to1. The typical return type from solr was something I wasn't sure of and standardizing to that might be of more value than the clarity of my proposed structure.

putmantime avatar Mar 01 '22 17:03 putmantime