RTX icon indicating copy to clipboard operation
RTX copied to clipboard

ARAX returning results with predicates that are not supported by Biolink Model 3.1.2

Open edeutsch opened this issue 2 years ago • 65 comments

I looked into this very briefly so I thought I'd start an issue locally.

Related to this: https://github.com/NCATSTranslator/Feedback/issues/118

It seems like there are two possibilities:

  1. KG2.8.0 contains "gene_associated_with_condition", which apparently it should not
  2. They are querying production KG2.7.6 instead of CI KG2.8.0

The query in question appears to be:

827077 2023-02-15 19:42:47 40 ars.ci.transltr.io 52.4.10.150, 10.11.0.190 arax.ci.transltr.io arax-0 ARAX 5094 129227 ✓ Completed OK Normal completion with 24 results.

Are the result is https://arax.ncats.io/?r=129227 https://arax.ncats.io/api/arax/v1.3/response/129227

As far as I can tell they are querying CI with KG2.8.0, so I think that rules out option 2. Although I am not really certain.

There is also potentially option 3: 3) As far as we were aware, the ask was for gene-chemical associations: image not for gene-disease associations. So this is not yet done.

That's about all I know. Tossing this out for others more knowledgeable..

edeutsch avatar Feb 15 '23 21:02 edeutsch

Hi @edeutsch thank you for reporting this issue in the RTX project area so we can track it locally. I guess my first order of business is to try and ascertain, when a query is posted on ui.ci.transltr.io, which ARAX endpoint is queried? How would we know?

saramsey avatar Feb 17 '23 18:02 saramsey

I'm uncertain that I understand the question correctly, but the endpoint is likely to be /async_query. But could be /query. I think an easy way to know is to look at the second log entry, and if it is "asynchronous Query launching on incoming Query", then you know that the endpoint was /async_query

edeutsch avatar Feb 17 '23 19:02 edeutsch

Thanks @edeutsch. I didn't phrase my question well. I meant, what ARAX API base URL is being hit? How do we know that it is an ARAX installation that configured to query RTX-KG2.8.0c?

saramsey avatar Feb 17 '23 21:02 saramsey

Well, you can see in the OP that arax.ci.transltr.io is being hit. And arax.ci.transltr.io should be hitting KG2.8.0c. But it is not trivial to be sure. I sifted through the logs to find a tell and did not come up with one. I thought it listed in the logs exactly which KP endpoint URLs are being bit, but I didn't see it there. Maybe I just missed it.

I just went to arax.ci.transltr.io and issued our example query and only 17 results came back, which I think is the surest sign that arax.ci.transltr.io is using KG2.8.0c right now (KG2.7.6c returns >100). But was it 2.8.0c when the initial query was sent? I'm not 100% certain. I think so, but there is room for uncertainty here. Maybe we need to consider augmenting the log messages to make it more clear.

edeutsch avatar Feb 17 '23 21:02 edeutsch

Well, you can see in the OP that arax.ci.transltr.io is being hit. And arax.ci.transltr.io should be hitting KG2.8.0c. But it is not trivial to be sure. I sifted through the logs to find a tell and did not come up with one. I thought it listed in the logs exactly which KP endpoint URLs are being bit, but I didn't see it there. Maybe I just missed it.

I just went to arax.ci.transltr.io and issued our example query and only 17 results came back, which I think is the surest sign that arax.ci.transltr.io is using KG2.8.0c right now (KG2.7.6c returns >100). But was it 2.8.0c when the initial query was sent? I'm not 100% certain. I think so, but there is room for uncertainty here. Maybe we need to consider augmenting the log messages to make it more clear.

Thanks for your reply. Because I cannot remember any of these details, I took a look at this Google Sheet which seems to aim to disambiguate the various ARAX instances:

https://docs.google.com/spreadsheets/d/1eC3GrRW6gw5zn7XKjvaHD9KCulO-GqEaTJ3mG6PgOOY/edit#gid=0

Screen Shot 2023-02-17 at 1 13 00 PM

Isn't KG2.8.0c only served up in development maturity level ARAX installations?

saramsey avatar Feb 17 '23 21:02 saramsey

I would imagine that arax.ci.transltr.io would be querying the RTX-KG2 service on kg2.ci.transltr.io. As far as I know (but maybe I am out-of-date), that instance is running code from the master branch of the RTX project area, whereas I thought that, in the RTX project area, all the KG2.8.0c stuff was in the kg2-integration branch. Paging @amykglen and @acevedol for a sanity check on what I just wrote.

saramsey avatar Feb 17 '23 21:02 saramsey

Ah, wait, I see that the master branch of RTX has file code/config_dbs.json which is clearly referencing KG2.8.0c, Screen Shot 2023-02-17 at 1 18 41 PM

saramsey avatar Feb 17 '23 21:02 saramsey

It is my understanding:

I would imagine that arax.ci.transltr.io would be querying the RTX-KG2 service on kg2.ci.transltr.io.

correct.

As far as I know (but maybe I am out-of-date), that instance is running code from the master branch of the RTX project area,

correct.

whereas I thought that, in the RTX project area, all the KG2.8.0c stuff was in the kg2-integration branch.

This is true, but it is also now in master. So anything running master should have KG2.8.0.

Paging @amykglen and @acevedol for a sanity check on what I just wrote.

edeutsch avatar Feb 17 '23 21:02 edeutsch

Thanks @edeutsch. So, the view from the synonymizer supports your expectation that arax.ci.transltr.io is indeed backed by the RTX-KG2.8.0c KP:

Screen Shot 2023-02-17 at 1 26 30 PM

saramsey avatar Feb 17 '23 21:02 saramsey

  1. KG2.8.0 contains "gene_associated_with_condition", which apparently it should not

Well, gene associated with condition is in the Biolink 3.0 spec (the version of Biolink against which RTX-KG2.8.0pre was built). Here is the permalink: https://github.com/biolink/biolink-model/blob/1efe2ed5a738f9cb4c32566f8cb7e713f62fa1ab/biolink-model.yaml#L4420

That predicate is also not deprecated:

Screen Shot 2023-02-17 at 1 33 29 PM

saramsey avatar Feb 17 '23 21:02 saramsey

Note, the predicate gene associated with condition is also in Biolink 3.1.2, and also (in that release) not deprecated: Screen Shot 2023-02-17 at 1 36 18 PM

Also, I note that in NCATSTranslator/Feedback issue 118, they didn't boldface gene_associated_with_condition, so I am not sure that is the predicate they were raising an issue about?

Screen Shot 2023-02-17 at 1 36 53 PM

saramsey avatar Feb 17 '23 21:02 saramsey

Now, as to KCNH3 [increases_activity_of] Gentamycin, that's another story. That edge does indeed seem to be coming from "RTX-KG2",

Screen Shot 2023-02-17 at 1 38 23 PM

saramsey avatar Feb 17 '23 21:02 saramsey

ah, sorry, my error, I mis-inferred which predicate they were unhappy about

edeutsch avatar Feb 17 '23 21:02 edeutsch

But RTX-KG2.8.0pre doesn't have that predicate!

Screen Shot 2023-02-17 at 1 40 30 PM

And yes, in the above query, I am talking to RTX-KG2.8.0pre:

Screen Shot 2023-02-17 at 1 41 06 PM

saramsey avatar Feb 17 '23 21:02 saramsey

So, my friends, we have a bit of a mystery on our hands. We have an ARAX result-set that supposedly comes from a query being executed via ui.ci.transltr.io,

https://arax.ncats.io/?r=cccb6699-29f4-492b-9733-ab77bd1a8261

that pulls up a collection of edges that includes a Biolink 2.X.X-era predicate (i.e., biolink:increases_activity_of). Which seems to imply that somewhere under the hood, that query is being serviced by an RTX-KG2 KP that is backed by KG2.7.6c and not KG2.8.0c. But how?

saramsey avatar Feb 17 '23 21:02 saramsey

Here is the JSON query:

{
  "edges": {
    "N1": {
      "attribute_constraints": [],
      "object": "sn",
      "predicates": [
        "biolink:has_normalized_google_distance_with"
      ],
      "qualifier_constraints": [],
      "subject": "on"
    },
    "creative_DTD_qedge_0": {
      "attribute_constraints": [],
      "exclude": false,
      "object": "creative_DTD_qnode_0",
      "option_group_id": "creative_DTD_option_group_0",
      "qualifier_constraints": [],
      "subject": "sn"
    },
    "creative_DTD_qedge_1": {
      "attribute_constraints": [],
      "exclude": false,
      "object": "creative_DTD_qnode_1",
      "option_group_id": "creative_DTD_option_group_0",
      "qualifier_constraints": [],
      "subject": "creative_DTD_qnode_0"
    },
    "creative_DTD_qedge_2": {
      "attribute_constraints": [],
      "exclude": false,
      "object": "on",
      "option_group_id": "creative_DTD_option_group_0",
      "qualifier_constraints": [],
      "subject": "creative_DTD_qnode_1"
    },
    "t_edge": {
      "attribute_constraints": [],
      "knowledge_type": "inferred",
      "object": "on",
      "predicates": [
        "biolink:treats"
      ],
      "qualifier_constraints": [],
      "subject": "sn"
    }
  },
  "nodes": {
    "creative_DTD_qnode_0": {
      "constraints": [],
      "is_set": true,
      "option_group_id": "creative_DTD_option_group_0"
    },
    "creative_DTD_qnode_1": {
      "constraints": [],
      "is_set": true,
      "option_group_id": "creative_DTD_option_group_0"
    },
    "on": {
      "categories": [
        "biolink:Disease"
      ],
      "constraints": [],
      "ids": [
        "MONDO:0007972"
      ],
      "is_set": false
    },
    "sn": {
      "categories": [
        "biolink:NamedThing"
      ],
      "constraints": [],
      "is_set": false
    }
  }
}

saramsey avatar Feb 17 '23 21:02 saramsey

I am posting this DSL query to arax.ci.transltr.io right now, to see what we get:

add_qnode(ids=CHEMBL.COMPOUND:CHEMBL643, key=n0)
add_qnode(categories=biolink:Protein, key=n1)
add_qedge(subject=n0, object=n1, key=e0)
expand(kp=infores:rtx-kg2)
resultify()
filter_results(action=limit_number_of_results, max_results=100)

I am hoping to pull up the CHEMBL.COMPOUND:CHEMBL643--UniProtKB:Q9ULD8 edge so I can examine the predicate. Fingers crossed....

saramsey avatar Feb 17 '23 22:02 saramsey

So, here is the edge in question. Note the Biolink-3.0-compatible predicate:

Screen Shot 2023-02-17 at 2 10 25 PM

saramsey avatar Feb 17 '23 22:02 saramsey

And if we click on the edge, we see:

Screen Shot 2023-02-17 at 2 11 01 PM

saramsey avatar Feb 17 '23 22:02 saramsey

This latest evidence motivates me to ask, what exactly tells us that ui.ci.transltr.io is querying arax.ci.transltr.io specifically (and not some other ARAX instance) via the ARS?

saramsey avatar Feb 17 '23 22:02 saramsey

aha, wait a sec. The query_graph you show above has lots of "creative_DTD" stuff in it (that I don't fully understand).. what if this edge is coming out of DTD results? and not from KG2.8.0c itself?

edeutsch avatar Feb 17 '23 22:02 edeutsch

Would DTD be giving an edge with 'biolink:increases_activity_of` as the predicate?

saramsey avatar Feb 17 '23 22:02 saramsey

Wouldn't DTD leave some trace in the edge attributes (like EPC type stuff?), that it was a predicted edge?

saramsey avatar Feb 17 '23 22:02 saramsey

I don't know. Perhaps my idea is preposterous. But sometimes that's all I've got. Quite often, actually.. I think we may need @amykglen and @chunyuma to come to the rescue..

edeutsch avatar Feb 17 '23 23:02 edeutsch

So, I have confirmed that in KG2.7.6c, the old (Biolink-2.X.X-era) predicate is there:

Screen Shot 2023-02-17 at 3 36 57 PM

saramsey avatar Feb 17 '23 23:02 saramsey

At this point, I am fairly confident that, somehow, the cached result posted in NCATSTranslator/Feedback issue 118 is showing an edge from RTX-KG2.7.6c.

saramsey avatar Feb 17 '23 23:02 saramsey

so I think because the original query was a knowledge_type: inferred query, I don't think RTX-KG2 is actually being queried directly as a KP (at least in the usual sense).

I think it's XDTD that added all those KG2 edges, as @edeutsch suggested - and I don't know the details of how XDTD does that. not sure if it queries KG2 at runtime? but certainly those edges are from an older KG2 version. @chunyuma or @dkoslicki would know the details about where those edges are coming from.

of note, the edges added by XDTD seem to be lacking attributes (except for one attribute each, indicating they came from KG2); that probably isn't ideal. at the very least we should be 'tagging' those edges to make it clear they were added by XDTD, I'd think?

amykglen avatar Feb 18 '23 01:02 amykglen

so I think because the original query was a knowledge_type: inferred query, I don't think RTX-KG2 is actually being queried directly as a KP (at least in the usual sense).

I think it's XDTD that added all those KG2 edges, as @edeutsch suggested - and I don't know the details of how XDTD does that. not sure if it queries KG2 at runtime? but certainly those edges are from an older KG2 version. @chunyuma or @dkoslicki would know the details about where those edges are coming from.

of note, the edges added by XDTD seem to be lacking attributes (except for one attribute each, indicating they came from KG2); that probably isn't ideal. at the very least we should be 'tagging' those edges to make it clear they were added by XDTD, I'd think?

Thank you, @amykglen. Agreed on all points.

saramsey avatar Feb 18 '23 06:02 saramsey

Hi @chunyuma when you have a moment, could you please weigh in on this? We want to know if the DTD module could be adding edges to the query-specific KG, with predicates like biolink:increases_activity_of.

saramsey avatar Feb 21 '23 17:02 saramsey

Sorry for the delay @saramsey and @amykglen. The xDTD model does not query KG2 at run-time. This is likely a problem due to training the xDTD model on a previous version of KG2 and predicates have since changed. @chunyuma said he can look into how to address that. Any idea what the priority level is for this?

dkoslicki avatar Feb 21 '23 18:02 dkoslicki