RTX icon indicating copy to clipboard operation
RTX copied to clipboard

Can we make the test suite faster?

Open edeutsch opened this issue 2 years ago • 13 comments

it seems that the current the test suite is really quite slow. It seems to take an hour to complete even WITHOUT the --runslow option?

============================================== 116 passed, 118 skipped, 66517 warnings in 3798.36s (1:03:18) ==============================================

I think this make me and others less keen to run tests at all because it just takes so long. But this is debatable How long is "reasonable" for the main test suite to run? 10 minutes? What's the solution? skipping more?

Crazy idea: is it possible that some recent development that we've done (e.g. autodiscovery of resources) could be slowing down each and every query? Maybe some better caching would speed everything up tremendously?

I have no solutions to offer, but maybe we should discuss at the next AHM and set someone to do some profiling to determine where the bottlenecks are?

edeutsch avatar Jan 11 '23 20:01 edeutsch

I agree it seems like some kind of systemic slowness was introduced in the last year, which certainly could be related to our autodiscovery of KPs, and/or who knows what else. perhaps also DTD is contributing some to slowness? #1932

10 minutes sounds like a great target for a core test suite.

to get us started here are the 30 slowest 'fast' tests in our pytest suite (run on my laptop):

11244.74s call     test_ARAX_json_queries.py::test_drug_disease_query
2691.39s call     test_ARAX_standup_queries.py::test_diseases_treated_by_drug_issue_20
366.90s call     test_ARAX_translate.py::test_overlay_after_lookup
303.18s call     test_ARAX_translate.py::test_lookup
154.62s call     test_ARAX_workflows.py::test_exclude
120.39s call     test_ARAX_translate.py::test_score
108.21s call     test_ARAX_connect.py::test_connect_acetaminophen_to_ptgs1
105.86s call     test_ARAX_standup_queries.py::test_chemical_substances_that_down_regulate_STK11_issue_28
72.34s call     test_ARAX_standup_queries.py::test_chemicals_to_gene_issue_10
70.11s call     test_ARAX_standup_queries.py::test_named_thing_associated_with_acrocynaosis_issue_12
68.35s call     test_ARAX_translate.py::test_fill_success
64.67s call     test_ARAX_expand.py::test_edge_constraints
55.15s call     test_ARAX_standup_queries.py::test_gene_to_pathway_issue_9
51.37s call     test_ARAX_workflows.py::test_FET_ranking_2
45.55s call     test_ARAX_expand.py::test_many_kp_query
44.29s call     test_ARAX_translate.py::test_fill_error
41.88s call     test_ARAX_json_queries.py::test_ngd_added
41.59s call     test_ARAX_overlay.py::test_FET_ex1
41.03s call     test_ARAX_overlay.py::test_jaccard
39.49s call     test_ARAX_filter_kg.py::test_edge_key_removal
38.74s call     test_ARAX_expand.py::test_exclude_edge_no_results
37.17s call     test_ARAX_query.py::test_query_by_query_graph_2
37.07s call     test_ARAX_json_queries.py::test_query_by_query_graph_2
37.00s call     test_ARAX_expand.py::test_option_group_query_one_hop
36.53s call     test_ARAX_expand.py::test_xdtd_different_predicates
36.48s call     test_ARAX_translate.py::test_bind
30.22s call     test_ARAX_resultify.py::test_node_binding_query_id_one_hop_single_input_curie
30.05s call     test_ARAX_translate.py::test_complete_results
29.60s call     test_ARAX_filter_kg.py::test_warnings
29.57s call     test_ARAX_translate.py::test_filter_results_top_n

the longest test took 3 hours apparently! and the second longest was 45 minutes. I think those two tests both took a long time on the DTD step. I think even longer than before we rolled out the new DTD database yesterday evening?

amykglen avatar Jan 13 '23 06:01 amykglen

of note, it looks like the test suite currently takes too long for our automated testing system to complete (it times out), so our build is showing as failing in github

amykglen avatar Jan 13 '23 06:01 amykglen

another idea: instead of performing 50 from-scratch queries and asserting one or two things in each, might it be possible to perform only a handful of from-scratch queries and then test/assert a lot of things (very quickly) about those very few results?

edeutsch avatar Jan 13 '23 06:01 edeutsch

yeah, great idea, I was thinking the same. think we could drastically reduce our number of tests by doing that.

amykglen avatar Jan 13 '23 16:01 amykglen

hey @chunyuma @dkoslicki - just a heads up that I just marked two pytests that involve DTD as 'slow' (see above commit) in an attempt to try to stop cicd.rtx.ai from timing out (and thus stop showing our build as failing). apologies for any trouble

amykglen avatar Jan 13 '23 17:01 amykglen

Hey @amykglen, let's see if increasing the DTD expand speed in issue #1932 can make these two pytests run faster.

chunyuma avatar Jan 14 '23 23:01 chunyuma

Here's a play-by-play of our acetaminophen query, which takes 57 seconds:

2023-01-18T19:49:23.180082 DEBUG: in asynchronous_query
2023-01-18T19:49:23.180170 DEBUG: In query_return_stream
2023-01-18T19:49:23.180458 INFO: ARAX Query launching on incoming Query
2023-01-18T19:49:23.180507 DEBUG: RTXConfiguration says maturity=development, current_branch=master, is_itrb_instance=False, arax_version=1.3.1, trapi_version=1.3.0
2023-01-18T19:49:23.180549 INFO: Creating an empty template TRAPI Response
2023-01-18T19:49:24.252743 INFO: Examine input Query for needed information for dispatch
2023-01-18T19:49:24.252807 INFO: Validating the input query graph
2023-01-18T19:49:24.252840 DEBUG: Deserializing message
2023-01-18T19:49:24.253025 INFO: Found input query_graph. Interpreting it and generating ARAXi processing plan to answer it
2023-01-18T19:49:24.303305 DEBUG: Assessing the QueryGraph for basic information
2023-01-18T19:49:24.303346 DEBUG: Found 2 nodes and 1 edges
2023-01-18T19:49:26.314943 DEBUG: canonical_curies={'CHEMBL.COMPOUND:CHEMBL112': {'preferred_curie': 'CHEMBL.COMPOUND:CHEMBL112', 'preferred_name': 'ACETAMINOPHEN', 'preferred_category': 'biolink:SmallMolecule', 'expanded_categories': {'biolink:SmallMolecule': True}, 'all_categories': {'biolink:ChemicalEntity': 14, 'biolink:SmallMolecule': 4, 'biolink:MolecularEntity': 1, 'biolink:Procedure': 1, 'biolink:Drug': 3, 'biolink:BiologicalEntity': 3}}}, curie=CHEMBL.COMPOUND:CHEMBL112

why does this take 2 whole seconds? Node Normalizer?

2023-01-18T19:49:26.315094 DEBUG: The QueryGraph reference template is: n00(ids)-e00()-n01(categories=biolink:Protein)
2023-01-18T19:49:26.315394 INFO: Found input processing plan. Sending to the ProcessingPlanExecutor
2023-01-18T19:49:26.315437 DEBUG: Entering execute_processing_plan
2023-01-18T19:49:26.969301 DEBUG: A single Message is ready and in hand
2023-01-18T19:49:26.969360 DEBUG: Found actions
2023-01-18T19:49:26.969402 INFO: Parsing input actions list
2023-01-18T19:49:26.969425 DEBUG: Parsing action: expand()
2023-01-18T19:49:26.969520 DEBUG: Parsing action: overlay(action=compute_ngd, virtual_relation_label=N1, subject_qnode_key=n00, object_qnode_key=n01)
2023-01-18T19:49:26.969573 DEBUG: Parsing action: resultify()
2023-01-18T19:49:26.969631 DEBUG: Parsing action: filter_results(action=limit_number_of_results, max_results=500)

what's going on here for 1 second?

2023-01-18T19:49:28.012946 INFO: Processing action 'expand' with parameters {'': 'true'}
2023-01-18T19:49:28.639556 DEBUG: Loading record of KP timeouts
2023-01-18T19:49:28.639911 DEBUG: Loading meta map (already exists and isn't due for a refresh)
2023-01-18T19:49:29.190426 DEBUG: Applying Expand to Message with parameters {'kp': None, 'edge_key': ['e00'], 'node_key': [], 'prune_threshold': 50, 'kp_timeout': 30, 'return_minimal_metadata': False, 'DTD_threshold': 0.8, 'DTD_slow_mode': False}
2023-01-18T19:49:29.714799 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 1 curies
2023-01-18T19:49:30.345899 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:49:30.346098 DEBUG: Inferred category for qnode n00 is ['biolink:ChemicalEntity']
2023-01-18T19:49:30.346152 DEBUG: Making sure QG only uses canonical predicates
2023-01-18T19:49:30.346603 DEBUG: Query graph for this Expand() call is: {'nodes': {'n00': {'ids': ['CHEMBL.COMPOUND:CHEMBL112'], 'categories': ['biolink:ChemicalEntity'], 'is_set': False, 'constraints': [], 'option_group_id': None}, 'n01': {'ids': None, 'categories': ['biolink:Gene', 'biolink:Protein'], 'is_set': False, 'constraints': [], 'option_group_id': None}}, 'edges': {'e00': {'knowledge_type': None, 'predicates': ['biolink:physically_interacts_with'], 'subject': 'n00', 'object': 'n01', 'attribute_constraints': [], 'qualifier_constraints': [], 'exclude': None, 'option_group_id': None}}}
2023-01-18T19:49:30.346711 DEBUG: Expanding qedge e00
2023-01-18T19:49:30.347064 DEBUG: Modified QG for this qedge is (n00:['biolink:ChemicalEntity'] ['CHEMBL.COMPOUND:CHEMBL112'])-['biolink:physically_interacts_with']-(n01:['biolink:Gene', 'biolink:Protein'])
2023-01-18T19:49:30.347094 DEBUG: For e00, pre-prune threshold is 50
2023-01-18T19:49:30.347121 DEBUG: Selecting KPs to use for qedge e00
2023-01-18T19:49:30.347270 DEBUG: selecting from 33 kps
2023-01-18T19:49:30.352955 DEBUG: Skipped infores:sri-ontology: KP does not have a development TRAPI 1.3 endpoint
2023-01-18T19:49:30.352990 DEBUG: Skipped infores:connections-hypothesis: KP does not have a development TRAPI 1.3 endpoint
2023-01-18T19:49:30.353025 INFO: The KPs Expand decided to answer e00 with are: {'infores:rtx-kg2', 'infores:molepro', 'infores:automat-robokop', 'infores:spoke', 'infores:automat-pharos', 'infores:automat-drug-central', 'infores:automat-hetio', 'infores:automat-gtopdb', 'infores:cam-kp', 'infores:service-provider-trapi'}
2023-01-18T19:49:30.353056 DEBUG: Will use asyncio to run KP queries concurrently

2023-01-18T19:49:30.353673 INFO: Expanding qedge e00 using infores:rtx-kg2
2023-01-18T19:49:32.020051 DEBUG: infores:rtx-kg2: Qnodes with an implied parent query ID are: {'n00': 'CHEMBL.COMPOUND:CHEMBL112'}
2023-01-18T19:49:32.548372 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 1 curies
2023-01-18T19:49:33.075758 DEBUG: Got response back from NodeSynonymizer

Takes 0.5 second

2023-01-18T19:49:33.076039 DEBUG: infores:rtx-kg2: Sending query to infores:rtx-kg2 API (https://arax.ncats.io/beta/api/rtxkg2/v1.3)
2023-01-18T19:49:33.077176 INFO: Expanding qedge e00 using infores:molepro
2023-01-18T19:49:33.077722 DEBUG: infores:molepro: Qnodes with an implied parent query ID are: {'n00': 'CHEMBL.COMPOUND:CHEMBL112'}
2023-01-18T19:49:33.595353 DEBUG: infores:molepro: One or more n00 curies use a prefix infores:molepro doesn't support; will convert these
2023-01-18T19:49:33.595415 DEBUG: infores:molepro: Converting curies in the QG to kinds that infores:molepro can answer
2023-01-18T19:49:34.230183 DEBUG: infores:molepro: Prefixes infores:molepro supports for categories ['biolink:ChemicalEntity'] (and descendants) are: {'UNII', 'NCIT', 'CHEBI', 'CHEMBL.TARGET', 'CHEMBL', 'RXCUI', 'INCHIKEY', 'DRUGSTORE', 'GTOPDB', 'INCHI', 'MESH', 'CHEMBANK', 'DRUGBANK', 'LIPIDMAPS', 'PDB.LIGAND', 'PUBCHEM.COMPOUND', 'DRUGCENTRAL', 'BIGG.METABOLITE', 'CID', 'XCHEM', 'CAS', 'KEGG.COMPOUND', 'BINDINGDB', 'SMILES', 'HMDB', 'CANSAR', 'MYCHEM_INFO'}
2023-01-18T19:49:35.275015 DEBUG: infores:molepro: Converted n00's 1 curies to a list of 1 curies tailored for infores:molepro
2023-01-18T19:49:35.275196 DEBUG: infores:molepro: Sending query to infores:molepro API (https://translator.broadinstitute.org/molepro/trapi/v1.3)
2023-01-18T19:49:35.275819 INFO: Expanding qedge e00 using infores:automat-robokop
2023-01-18T19:49:35.276177 DEBUG: infores:automat-robokop: Qnodes with an implied parent query ID are: {'n00': 'CHEMBL.COMPOUND:CHEMBL112'}
2023-01-18T19:49:35.918944 DEBUG: infores:automat-robokop: All n00 curies use prefix(es) infores:automat-robokop supports; no conversion necessary
2023-01-18T19:49:35.919153 DEBUG: infores:automat-robokop: Sending query to infores:automat-robokop API (https://automat.renci.org/robokopkg/1.3)
2023-01-18T19:49:35.919767 INFO: Expanding qedge e00 using infores:spoke
2023-01-18T19:49:35.920129 DEBUG: infores:spoke: Qnodes with an implied parent query ID are: {'n00': 'CHEMBL.COMPOUND:CHEMBL112'}
2023-01-18T19:49:36.439214 DEBUG: infores:spoke: All n00 curies use prefix(es) infores:spoke supports; no conversion necessary
2023-01-18T19:49:36.439421 DEBUG: infores:spoke: Sending query to infores:spoke API (https://spokekp.healthdatascience.cloud/api/v1.3)
2023-01-18T19:49:36.440032 INFO: Expanding qedge e00 using infores:automat-pharos
2023-01-18T19:49:36.440392 DEBUG: infores:automat-pharos: Qnodes with an implied parent query ID are: {'n00': 'CHEMBL.COMPOUND:CHEMBL112'}
2023-01-18T19:49:36.960583 DEBUG: infores:automat-pharos: All n00 curies use prefix(es) infores:automat-pharos supports; no conversion necessary
2023-01-18T19:49:36.960932 DEBUG: infores:automat-pharos: Sending query to infores:automat-pharos API (https://automat.renci.org/pharos/1.3)
2023-01-18T19:49:36.961758 INFO: Expanding qedge e00 using infores:automat-drug-central
2023-01-18T19:49:36.962120 DEBUG: infores:automat-drug-central: Qnodes with an implied parent query ID are: {'n00': 'CHEMBL.COMPOUND:CHEMBL112'}
2023-01-18T19:49:37.596333 DEBUG: infores:automat-drug-central: All n00 curies use prefix(es) infores:automat-drug-central supports; no conversion necessary
2023-01-18T19:49:37.596543 DEBUG: infores:automat-drug-central: Sending query to infores:automat-drug-central API (https://automat.renci.org/drugcentral/1.3)
2023-01-18T19:49:37.597153 INFO: Expanding qedge e00 using infores:automat-hetio
2023-01-18T19:49:37.597536 DEBUG: infores:automat-hetio: Qnodes with an implied parent query ID are: {'n00': 'CHEMBL.COMPOUND:CHEMBL112'}
2023-01-18T19:49:38.113772 DEBUG: infores:automat-hetio: All n00 curies use prefix(es) infores:automat-hetio supports; no conversion necessary
2023-01-18T19:49:38.113977 DEBUG: infores:automat-hetio: Sending query to infores:automat-hetio API (https://automat.renci.org/hetio/1.3)
2023-01-18T19:49:38.114581 INFO: Expanding qedge e00 using infores:automat-gtopdb
2023-01-18T19:49:38.114937 DEBUG: infores:automat-gtopdb: Qnodes with an implied parent query ID are: {'n00': 'CHEMBL.COMPOUND:CHEMBL112'}
2023-01-18T19:49:38.635391 DEBUG: infores:automat-gtopdb: All n00 curies use prefix(es) infores:automat-gtopdb supports; no conversion necessary
2023-01-18T19:49:38.635598 DEBUG: infores:automat-gtopdb: Sending query to infores:automat-gtopdb API (https://automat.renci.org/gtopdb/1.3)
2023-01-18T19:49:38.636212 INFO: Expanding qedge e00 using infores:cam-kp
2023-01-18T19:49:38.636571 DEBUG: infores:cam-kp: Qnodes with an implied parent query ID are: {'n00': 'CHEMBL.COMPOUND:CHEMBL112'}
2023-01-18T19:49:39.155499 DEBUG: infores:cam-kp: One or more n00 curies use a prefix infores:cam-kp doesn't support; will convert these
2023-01-18T19:49:39.155561 DEBUG: infores:cam-kp: Converting curies in the QG to kinds that infores:cam-kp can answer
2023-01-18T19:49:39.786303 DEBUG: infores:cam-kp: Prefixes infores:cam-kp supports for categories ['biolink:ChemicalEntity'] (and descendants) are: {'UNIPROTKB', 'ZFIN', 'NCBIGENE', 'RGD', 'CHEBI', 'SIO', 'FB', 'SO', 'HGNC', 'PR', 'TAIR.LOCUS', 'OBI', 'OBO', 'SGD', 'GENO', 'DICTYBASE.GENE', 'FOODON', 'MGI', 'REACTO', 'XENBASE', 'POMBASE', 'ECTO'}
2023-01-18T19:49:40.825895 DEBUG: infores:cam-kp: Converted n00's 1 curies to a list of 1 curies tailored for infores:cam-kp
2023-01-18T19:49:40.826088 DEBUG: infores:cam-kp: Sending query to infores:cam-kp API (https://cam-kp-api.renci.org/1.3.0)
2023-01-18T19:49:40.826702 INFO: Expanding qedge e00 using infores:service-provider-trapi
2023-01-18T19:49:40.827062 DEBUG: infores:service-provider-trapi: Qnodes with an implied parent query ID are: {'n00': 'CHEMBL.COMPOUND:CHEMBL112'}
2023-01-18T19:49:41.468169 DEBUG: infores:service-provider-trapi: All n00 curies use prefix(es) infores:service-provider-trapi supports; no conversion necessary
2023-01-18T19:49:41.468378 DEBUG: infores:service-provider-trapi: Sending query to infores:service-provider-trapi API (https://api.bte.ncats.io/v1/team/Service%20Provider)

it took 8 seconds to launch all the queries to the various sources?

2023-01-18T19:49:41.493588 WARNING: infores:spoke: No 'message' was included in the response from infores:spoke. Response was: { "description": "SPOKE KP could not match this combination of predicates and (if present) qualifier constraints; returning empty message...", "message": {}, "status": "Query unprocessable" }
2023-01-18T19:49:41.493781 INFO: infores:spoke: Query for edge e00 completed (no answers)
2023-01-18T19:49:42.014433 DEBUG: infores:spoke: Deduplicating nodes
2023-01-18T19:49:42.014515 DEBUG: infores:spoke: After deduplication, answer KG counts are: no answers
2023-01-18T19:49:42.340242 WARNING: infores:automat-robokop: Returned HTTP error 404 after 6 seconds. Query sent to KP was: {'message': {'query_graph': {'nodes': {'n00': {'ids': ['CHEMBL.COMPOUND:CHEMBL112'], 'categories': ['biolink:ChemicalEntity'], 'is_set': False}, 'n01': {'categories': ['biolink:Gene', 'biolink:Protein'], 'is_set': True}}, 'edges': {'e00': {'predicates': ['biolink:physically_interacts_with'], 'subject': 'n00', 'object': 'n01'}}}}}
2023-01-18T19:49:42.340467 INFO: infores:automat-robokop: Query for edge e00 completed (no answers)
2023-01-18T19:49:42.865029 DEBUG: infores:automat-robokop: Deduplicating nodes
2023-01-18T19:49:42.865106 DEBUG: infores:automat-robokop: After deduplication, answer KG counts are: no answers
2023-01-18T19:49:42.867965 DEBUG: infores:automat-drug-central: No 'results' were returned.
2023-01-18T19:49:42.868042 INFO: infores:automat-drug-central: Query for edge e00 completed (no answers)
2023-01-18T19:49:43.391793 DEBUG: infores:automat-drug-central: Deduplicating nodes
2023-01-18T19:49:43.391871 DEBUG: infores:automat-drug-central: After deduplication, answer KG counts are: no answers
2023-01-18T19:49:43.391947 DEBUG: infores:automat-gtopdb: No 'results' were returned.
2023-01-18T19:49:43.392016 INFO: infores:automat-gtopdb: Query for edge e00 completed (no answers)
2023-01-18T19:49:44.027063 DEBUG: infores:automat-gtopdb: Deduplicating nodes
2023-01-18T19:49:44.027141 DEBUG: infores:automat-gtopdb: After deduplication, answer KG counts are: no answers
2023-01-18T19:49:44.027215 DEBUG: infores:automat-hetio: No 'results' were returned.
2023-01-18T19:49:44.027284 INFO: infores:automat-hetio: Query for edge e00 completed (no answers)
2023-01-18T19:49:44.552105 DEBUG: infores:automat-hetio: Deduplicating nodes
2023-01-18T19:49:44.552185 DEBUG: infores:automat-hetio: After deduplication, answer KG counts are: no answers
2023-01-18T19:49:44.552263 DEBUG: infores:automat-pharos: No 'results' were returned.
2023-01-18T19:49:44.552331 INFO: infores:automat-pharos: Query for edge e00 completed (no answers)
2023-01-18T19:49:45.073755 DEBUG: infores:automat-pharos: Deduplicating nodes
2023-01-18T19:49:45.073831 DEBUG: infores:automat-pharos: After deduplication, answer KG counts are: no answers
2023-01-18T19:49:45.073937 DEBUG: infores:cam-kp: No 'results' were returned.
2023-01-18T19:49:45.074003 INFO: infores:cam-kp: Query for edge e00 completed (no answers)
2023-01-18T19:49:45.715882 DEBUG: infores:cam-kp: Deduplicating nodes
2023-01-18T19:49:45.715959 DEBUG: infores:cam-kp: After deduplication, answer KG counts are: no answers
2023-01-18T19:49:45.716033 DEBUG: infores:molepro: Got results from infores:molepro.
2023-01-18T19:49:46.242337 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 1 curies
2023-01-18T19:49:46.767057 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:49:47.295931 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 0 curies
2023-01-18T19:49:47.947227 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:49:48.474984 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 1 curies
2023-01-18T19:49:48.999118 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:49:48.999297 DEBUG: infores:molepro: Added 1 subclass_of edges to the KG based on query ID mappings infores:molepro returned
2023-01-18T19:49:48.999478 INFO: infores:molepro: Query for edge e00 completed (e00: 13, n00: 2, n01: 13, subclass:n00--n00: 1)
2023-01-18T19:49:49.524365 DEBUG: infores:molepro: Deduplicating nodes
2023-01-18T19:49:49.524435 DEBUG: infores:molepro: Getting preferred curies for n01 nodes returned in this step
2023-01-18T19:49:50.162391 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 13 curies
2023-01-18T19:49:50.689584 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:49:50.689699 WARNING: NodeSynonymizer did not recognize: {'UniProtKB:F8WCM5-1'}
2023-01-18T19:49:50.689837 DEBUG: infores:molepro: Getting preferred curies for n00 nodes returned in this step
2023-01-18T19:49:51.208711 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 2 curies
2023-01-18T19:49:51.852283 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:49:51.852468 DEBUG: infores:molepro: After deduplication, answer KG counts are: e00: 13, n00: 1, n01: 13, subclass:n00--n00: 1
2023-01-18T19:49:51.852511 DEBUG: infores:molepro: Removing any self-edges from the answer KG
2023-01-18T19:49:51.852570 DEBUG: infores:molepro: After removing self-edges, answer KG counts are: e00: 13, n00: 1, n01: 13, subclass:n00--n00: 0
2023-01-18T19:49:51.854921 DEBUG: infores:service-provider-trapi: Got results from infores:service-provider-trapi.
2023-01-18T19:49:52.383882 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 1 curies
2023-01-18T19:49:52.909131 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:49:53.433240 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 0 curies
2023-01-18T19:49:54.071974 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:49:54.594704 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 1 curies
2023-01-18T19:49:55.119158 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:49:55.119326 DEBUG: infores:service-provider-trapi: Added 1 subclass_of edges to the KG based on query ID mappings infores:service-provider-trapi returned
2023-01-18T19:49:55.119787 INFO: infores:service-provider-trapi: Query for edge e00 completed (e00: 27, n00: 2, n01: 27, subclass:n00--n00: 1)
2023-01-18T19:49:55.756701 DEBUG: infores:service-provider-trapi: Deduplicating nodes
2023-01-18T19:49:55.756772 DEBUG: infores:service-provider-trapi: Getting preferred curies for n01 nodes returned in this step
2023-01-18T19:49:56.274895 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 27 curies
2023-01-18T19:49:56.796560 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:49:56.796762 DEBUG: infores:service-provider-trapi: Getting preferred curies for n00 nodes returned in this step
2023-01-18T19:49:57.315294 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 2 curies
2023-01-18T19:49:57.949646 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:49:57.949861 DEBUG: infores:service-provider-trapi: After deduplication, answer KG counts are: e00: 27, n00: 1, n01: 27, subclass:n00--n00: 1
2023-01-18T19:49:57.949902 DEBUG: infores:service-provider-trapi: Removing any self-edges from the answer KG
2023-01-18T19:49:57.949975 DEBUG: infores:service-provider-trapi: After removing self-edges, answer KG counts are: e00: 27, n00: 1, n01: 27, subclass:n00--n00: 0
2023-01-18T19:49:57.951313 DEBUG: infores:rtx-kg2: Got results from infores:rtx-kg2.

17 seconds to collect the rest of the results from various KPs

2023-01-18T19:49:58.477222 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 0 curies
2023-01-18T19:49:59.000512 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:49:59.640601 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 0 curies
2023-01-18T19:50:00.168312 DEBUG: Got response back from NodeSynonymizer

burned 1.5 seconds asking NodeSynonymizer for 0 curies?

2023-01-18T19:50:00.168553 INFO: infores:rtx-kg2: Query for edge e00 completed (e00: 17, n00: 1, n01: 10)
2023-01-18T19:50:00.168589 DEBUG: infores:rtx-kg2: Removing any self-edges from the answer KG
2023-01-18T19:50:00.168633 DEBUG: infores:rtx-kg2: After removing self-edges, answer KG counts are: e00: 17, n00: 1, n01: 10
2023-01-18T19:50:00.169072 DEBUG: Got answers from all KPs; merging them into one KG

2023-01-18T19:50:00.169232 DEBUG: Merging answer into Message.KnowledgeGraph
2023-01-18T19:50:00.691832 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 1 curies
2023-01-18T19:50:01.220333 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:50:01.220526 DEBUG: Removed 0 edges fulfilling e00 from the KG because they were orphaned
2023-01-18T19:50:01.220562 DEBUG: Merging answer into Message.KnowledgeGraph
2023-01-18T19:50:01.856687 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 1 curies
2023-01-18T19:50:02.381339 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:50:02.381576 DEBUG: Removed 0 edges fulfilling e00 from the KG because they were orphaned
2023-01-18T19:50:02.381654 DEBUG: Merging answer into Message.KnowledgeGraph
2023-01-18T19:50:02.904487 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 1 curies
2023-01-18T19:50:03.544881 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:50:03.545017 DEBUG: Merging answer into Message.KnowledgeGraph
2023-01-18T19:50:04.069819 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 1 curies
2023-01-18T19:50:04.596437 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:50:04.596573 DEBUG: Merging answer into Message.KnowledgeGraph
2023-01-18T19:50:05.237587 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 1 curies
2023-01-18T19:50:05.760863 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:50:05.760999 DEBUG: Merging answer into Message.KnowledgeGraph
2023-01-18T19:50:06.289035 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 1 curies
2023-01-18T19:50:06.818237 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:50:06.818374 DEBUG: Merging answer into Message.KnowledgeGraph
2023-01-18T19:50:07.454586 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 1 curies
2023-01-18T19:50:07.979310 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:50:07.979447 DEBUG: Merging answer into Message.KnowledgeGraph
2023-01-18T19:50:08.502247 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 1 curies
2023-01-18T19:50:09.145131 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:50:09.145265 DEBUG: Merging answer into Message.KnowledgeGraph
2023-01-18T19:50:09.671377 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 1 curies
2023-01-18T19:50:10.197670 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:50:10.197812 DEBUG: Merging answer into Message.KnowledgeGraph
2023-01-18T19:50:10.722663 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 1 curies
2023-01-18T19:50:11.363602 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:50:11.363835 DEBUG: Removed 0 edges fulfilling e00 from the KG because they were orphaned
2023-01-18T19:50:11.363879 DEBUG: After merging KPs' answers, total KG counts are: e00: 57, n00: 1, n01: 50

11 seconds to merging the answers? Is this mostly because of the NodeSynonymizer??

2023-01-18T19:50:11.363908 DEBUG: Handling any knowledge source constraints
2023-01-18T19:50:11.363933 DEBUG: KP allowlist is None, denylist is set()
2023-01-18T19:50:11.364365 DEBUG: Removing 0 edges because they do not fulfill knowledge_source constraint
2023-01-18T19:50:11.364415 DEBUG: Pruning any paths that are now dead ends (with help of Resultify)
2023-01-18T19:50:11.364872 DEBUG: Calling Resultify from Expand for pruning
2023-01-18T19:50:11.367546 DEBUG: After removing dead-end paths, KG counts are: e00: 57, n00: 1, n01: 50
2023-01-18T19:50:11.367715 DEBUG: Decorating nodes with metadata from KG2c
2023-01-18T19:50:11.891611 DEBUG: Looking up corresponding KG2c nodes in sqlite
2023-01-18T19:50:11.892804 DEBUG: Adding attributes to nodes in the KG
2023-01-18T19:50:11.913869 DEBUG: Decorating edges with EPC info from KG2c
2023-01-18T19:50:11.914132 DEBUG: Identified 17 edges to decorate
2023-01-18T19:50:12.438156 DEBUG: Looking up EPC edge info in KG2c sqlite
2023-01-18T19:50:12.438613 DEBUG: Got 0 rows back from KG2c sqlite
2023-01-18T19:50:12.438645 DEBUG: Adding attributes to edges in the KG
2023-01-18T19:50:12.438937 DEBUG: Overriding node categories to better align with what's in the QG
2023-01-18T19:50:13.084168 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 1 curies
2023-01-18T19:50:13.613658 DEBUG: Got response back from NodeSynonymizer
2023-01-18T19:50:13.613772 DEBUG: No KG nodes found that use a different curie than was asked for in the QG
2023-01-18T19:50:13.613822 INFO: After Expand, the KG has 51 nodes and 57 edges (e00: 57, n00: 1, n01: 50)

2023-01-18T19:50:13.617159 INFO: Processing action 'overlay' with parameters {'action': 'compute_ngd', 'virtual_relation_label': 'N1', 'subject_qnode_key': 'n00', 'object_qnode_key': 'n01'}
2023-01-18T19:50:13.617264 DEBUG: Applying Overlay to Message with parameters {'action': 'compute_ngd', 'virtual_relation_label': 'N1', 'subject_qnode_key': 'n00', 'object_qnode_key': 'n01'}
2023-01-18T19:50:14.665773 DEBUG: Computing NGD
2023-01-18T19:50:14.665823 INFO: Computing the normalized Google distance: weighting edges based on subject/object node co-occurrence frequency in PubMed abstracts
2023-01-18T19:50:14.665867 DEBUG: Narrowing down n00--n01 node pairs to overlay
2023-01-18T19:50:14.673323 DEBUG: Identified 50 node pairs to overlay (with help of resultify)
2023-01-18T19:50:14.673507 DEBUG: Canonicalizing curies of relevant nodes using NodeSynonymizer
2023-01-18T19:50:15.832355 DEBUG: Extracting PMID lists from sqlite database for relevant nodes
2023-01-18T19:50:15.885835 DEBUG: Looping through 50 node pairs and calculating NGD values
2023-01-18T19:50:15.905462 DEBUG: More than 30 publications found for some edges limiting to 30...
2023-01-18T19:50:16.058049 INFO: NGD values successfully added to edges

2.3 seconds for NGD

2023-01-18T19:50:16.058125 DEBUG: Decorating edges with EPC info from KG2c
2023-01-18T19:50:16.058208 DEBUG: Identified 50 edges to decorate
2023-01-18T19:50:16.585175 DEBUG: Looking up EPC edge info in KG2c sqlite
2023-01-18T19:50:16.586185 DEBUG: Got 46 rows back from KG2c sqlite
2023-01-18T19:50:16.586220 DEBUG: Adding attributes to edges in the KG
2023-01-18T19:50:16.592336 DEBUG: Query graph is {'edges': {'N1': {'attribute_constraints': [], 'exclude': None, 'knowledge_type': None, 'object': 'n01', 'option_group_id': None, 'predicates': ['biolink:has_normalized_google_distance_with'], 'qualifier_constraints': [], 'subject': 'n00'}, 'e00': {'attribute_constraints': [], 'exclude': None, 'knowledge_type': None, 'object': 'n01', 'option_group_id': None, 'predicates': ['biolink:physically_interacts_with'], 'qualifier_constraints': [], 'subject': 'n00'}}, 'nodes': {'n00': {'categories': None, 'constraints': [], 'ids': ['CHEMBL.COMPOUND:CHEMBL112'], 'is_set': False, 'option_group_id': None}, 'n01': {'categories': ['biolink:Protein'], 'constraints': [], 'ids': None, 'is_set': False, 'option_group_id': None}}}
2023-01-18T19:50:16.592435 DEBUG: Number of nodes in KG is 51
2023-01-18T19:50:16.592553 DEBUG: Number of nodes in KG by type is Counter({'biolink:Gene': 34, 'biolink:Protein': 16, 'biolink:SmallMolecule': 1})
2023-01-18T19:50:16.592580 DEBUG: Number of edges in KG is 107
2023-01-18T19:50:16.592678 DEBUG: Number of edges in KG by type is Counter({'biolink:physically_interacts_with': 57, 'biolink:has_normalized_google_distance_with': 50})
2023-01-18T19:50:16.592734 DEBUG: Number of edges in KG with attributes is 107
2023-01-18T19:50:16.593127 DEBUG: Number of edges in KG by attribute Counter({None: 396, 'normalized_google_distance': 50, 'publications': 50, 'virtual_relation_label': 50, 'defined_datetime': 50, 'provided_by': 50, 'mode': 18, 'biolink:primary_knowledge_source': 13, 'biolink:aggregator_knowledge_source': 13})
2023-01-18T19:50:16.593159 INFO: Processing action 'resultify' with parameters {'': 'true'}
2023-01-18T19:50:16.593187 DEBUG: Applying Resultifier to Message with parameters {'': 'true'}
2023-01-18T19:50:16.593240 DEBUG: Expanded qedges are {'e00', 'N1'}, expanded qnodes are {'n00', 'n01'}; will resultify only this sub-QG
2023-01-18T19:50:16.593773 DEBUG: Subclass self-qedges are set(); qnodes with a subclass self-qedge are: set()
2023-01-18T19:50:16.593806 DEBUG: Building helper indexes for faster lookup of edges
2023-01-18T19:50:16.594460 DEBUG: Grabbing only required portion of QG
2023-01-18T19:50:16.594501 DEBUG: Required qnodes are {'n00', 'n01'}, required qedges are {'e00', 'N1'}
2023-01-18T19:50:16.594542 INFO: Creating result graphs for required portion of QG
2023-01-18T19:50:16.595066 DEBUG: Constructing result graphs qnode by qnode
2023-01-18T19:50:16.595095 DEBUG: Qnode keys already handled are: set()
2023-01-18T19:50:16.595118 DEBUG: Initiating result graphs with nodes for n00 (is_set=False)
2023-01-18T19:50:16.595136 DEBUG: Starting with a result graph for each n00 node
2023-01-18T19:50:16.595158 DEBUG: Current count of result graphs is 1
2023-01-18T19:50:16.595183 DEBUG: Next qnode chosen is: n01
2023-01-18T19:50:16.595202 DEBUG: Adding a layer to each result graph for qnode n01 (is_set=False)
2023-01-18T19:50:16.597020 DEBUG: Current count of result graphs is 50
2023-01-18T19:50:16.597048 DEBUG: Done assigning nodes to result graphs.
2023-01-18T19:50:16.597065 DEBUG: Adding edges to result graphs
2023-01-18T19:50:16.597535 DEBUG: After pruning out result graphs missing edges, there are 50 result graphs
2023-01-18T19:50:16.597569 DEBUG: Created 50 required result graphs
2023-01-18T19:50:16.597599 DEBUG: Required non-set qnodes are: ['n00', 'n01']
2023-01-18T19:50:16.597739 DEBUG: There are a total of 50 final result graphs
2023-01-18T19:50:16.599712 DEBUG: Loading final result graphs into TRAPI object model
2023-01-18T19:50:16.602229 INFO: Resultify created 50 results
2023-01-18T19:50:16.602330 DEBUG: Cleaning up the KG to remove nodes not used in the results
2023-01-18T19:50:16.602495 INFO: After cleaning, the KG contains 51 nodes and 107 edges

2023-01-18T19:50:16.602521 INFO: Running experimental reranker on results
2023-01-18T19:50:16.602568 DEBUG: Starting to rank results
2023-01-18T19:50:18.625266 INFO: Summary of available edge metrics: {'normalized_google_distance': {'minimum': 0.4792348874522798, 'maximum': 1.003703835795986}}
2023-01-18T19:50:18.659304 DEBUG: Results have been ranked and sorted

2 seconds to rank with "experimental reranker"

2023-01-18T19:50:18.659416 INFO: Processing action 'filter_results' with parameters {'action': 'limit_number_of_results', 'max_results': '500'}
2023-01-18T19:50:18.659447 DEBUG: Before filtering, there are 50 results
2023-01-18T19:50:18.662079 DEBUG: Limiting Number of Results
2023-01-18T19:50:18.662117 INFO: Filtering excess results above max result limit
2023-01-18T19:50:18.662396 INFO: KG successfully pruned to match results
2023-01-18T19:50:18.662422 INFO: Successfully limited the number of results
2023-01-18T19:50:18.662447 DEBUG: Applying Overlay to Message with parameters {'action': 'limit_number_of_results', 'max_results': 500, 'prune_kg': True}
2023-01-18T19:50:18.663183 DEBUG: Query graph is {'edges': {'N1': {'attribute_constraints': [], 'exclude': None, 'knowledge_type': None, 'object': 'n01', 'option_group_id': None, 'predicates': ['biolink:has_normalized_google_distance_with'], 'qualifier_constraints': [], 'subject': 'n00'}, 'e00': {'attribute_constraints': [], 'exclude': None, 'knowledge_type': None, 'object': 'n01', 'option_group_id': None, 'predicates': ['biolink:physically_interacts_with'], 'qualifier_constraints': [], 'subject': 'n00'}}, 'nodes': {'n00': {'categories': None, 'constraints': [], 'ids': ['CHEMBL.COMPOUND:CHEMBL112'], 'is_set': False, 'option_group_id': None}, 'n01': {'categories': ['biolink:Protein'], 'constraints': [], 'ids': None, 'is_set': False, 'option_group_id': None}}}
2023-01-18T19:50:18.663217 DEBUG: Number of nodes in KG is 51
2023-01-18T19:50:18.663306 DEBUG: Number of nodes in KG by type is Counter({'biolink:Gene': 34, 'biolink:Protein': 16, 'biolink:SmallMolecule': 1})
2023-01-18T19:50:18.663329 DEBUG: Number of edges in KG is 107
2023-01-18T19:50:18.663396 DEBUG: Number of edges in KG by type is Counter({'biolink:physically_interacts_with': 57, 'biolink:has_normalized_google_distance_with': 50})
2023-01-18T19:50:18.663450 DEBUG: Number of edges in KG with attributes is 107
2023-01-18T19:50:18.663830 DEBUG: Number of edges in KG by attribute Counter({None: 396, 'normalized_google_distance': 50, 'publications': 50, 'virtual_relation_label': 50, 'defined_datetime': 50, 'provided_by': 50, 'mode': 18, 'biolink:primary_knowledge_source': 13, 'biolink:aggregator_knowledge_source': 13})
2023-01-18T19:50:18.663895 DEBUG: Storing resulting Message
2023-01-18T19:50:18.663918 DEBUG: Writing response record to MySQL
2023-01-18T19:50:19.318669 DEBUG: Writing response JSON to S3 bucket
2023-01-18T19:50:19.594607 DEBUG: Wrote /responses/117003.json in 0.14478141069412231 seconds
2023-01-18T19:50:19.595809 INFO: Result was stored with id 117003. It can be viewed at https://arax.ncats.io/?r=117003
2023-01-18T19:50:19.595848 INFO: Processing is complete and resulted in 50 results.
2023-01-18T19:50:19.595869 INFO: Processing is complete. Transmitting resulting Message back to client.

1 second cleanup

Are there opportunities for streamlining here?

edeutsch avatar Jan 18 '23 20:01 edeutsch

Hi @amykglen, I've significantly reduced the test time of top 3 "slow" tests that you reported above:

3m28.225s call     test_ARAX_json_queries.py::test_drug_disease_query
1m10.596s call     test_ARAX_standup_queries.py::test_diseases_treated_by_drug_issue_20
1m18.035s call     test_ARAX_translate.py::test_overlay_after_lookup

Considering they still need more than 1 minute, I would keep marking them as slow tests in pytest. Please let me know anything else I can help for improving the speed.

chunyuma avatar Jan 19 '23 05:01 chunyuma

So it seems like the category manager may be the cause. When I run a NodeSynonymizer test for get_canonical_curies() I get:

==== Get canonical curies for a set of input curies ============================
WARNING: returned with status 404 while retrieving ancestors for biolink:SmallMolecule
WARNING: returned with status 404 while retrieving ancestors for biolink:Disease
WARNING: returned with status 404 while retrieving ancestors for biolink:PhenotypicFeature
WARNING: returned with status 404 while retrieving ancestors for biolink:AnatomicalEntity
WARNING: returned with status 404 while retrieving ancestors for biolink:Protein
WARNING: returned with status 404 while retrieving ancestors for biolink:Gene
WARNING: returned with status 404 while retrieving ancestors for biolink:SmallMolecule
WARNING: returned with status 404 while retrieving ancestors for biolink:Disease
WARNING: returned with status 404 while retrieving ancestors for biolink:PhenotypicFeature
...
Elapsed time: 5.717957079410553

So it looks like NodeSynonymizer is slow because it wants category information from the CategoryManager and the CategoryManager is thrashing away at an obsolete endpoint?

Is the CategoryManager still an important thing? or has it been replaced by something?

edeutsch avatar Jan 24 '23 05:01 edeutsch

So the problem is definitely that the CategoryManager is struggling and slow. One problem is that we internalize our (ARAX's) Biolink version as 3.0.3. And the RENCI Biolink lookup service only supports these versions:

$ curl -X 'GET' 'https://bl-lookup-sri.renci.org/versions' -H 'accept: application/json'
["v3.1.2","v3.1.1","v2.4.8","v2.4.7","v2.4.6","v2.4.5","v.2.4.4","v2.4.3","v2.4.2","v2.4.1","v2.4.0","v2.3.1","v2.3.0","v2.2.16","2.2.16","v2.2.15","v2.2.14","2.2.13","2.2.12","2.2.11","2.2.10","2.2.9","2.2.8","2.2.7","2.2.6","2.2.5","2.2.4","2.2.3","2.2.2","2.2.1","2.2.0","2.1.0","2.0.2","2.0.1","2.0.0","1.8.2","1.8.1","1.8.0","1.7.0","1.6.1","1.6.0","1.5.0","1.4.0","1.3.9","latest"]

which does not include what we think our Biolink version is. But there are other issues as well.

So I think the CategoryManager is broken.

But I also notice that there seems to be a very nice (and complex) ARAX BiolinkHelper module that I don't know at all.

Can I phone a friend that knows all about the BiolinkHelper to decide what to do with the CategoryManager? I'm thinking that BiolinkHelper might already do everything that CategoryManager does and we can retire CategoryManager?

In any case, the current state is that NodeSynonymizer is very slow because it calls CategoryManager and CategoryManager is thrashing about unable to function properly. And is probably too slow even if it did function correctly. We need a better solution and I bet someone out there knows how to do it! Help!

edeutsch avatar Jan 24 '23 05:01 edeutsch

yes, I believe BiolinkHelper should have all the functionality you need - other modules have switched to using it instead of the CategoryManager. we should probably retire it after verifying all usages have been replaced with BiolinkHelper.

you use it like this:

bh = BiolinkHelper()
bh.get_ancestors("biolink:Drug")
bh.get_descendants("biolink:ChemicalEntity")

it automatically detects and uses ARAX's current Biolink version (parsed from the OpenAPI yaml file), though you can also specify a different version if you want (BiolinkHelper(biolink_version="3.0.0")).

by default, get_ancestors() and get_descendants() include mixins and ARAX-defined conflations (Gene==Protein and Disease==PhenotypicFeature), though you can turn that behavior off with the include_mixins and include_conflations parameters.

you can also pass in lists of categories/predicates.

there's a readme here with a little more info: https://github.com/RTXteam/RTX/tree/master/code/ARAX/BiolinkHelper

amykglen avatar Jan 24 '23 17:01 amykglen

Hi everyone, an update on this. Upon first testing the NodeSynonymizer slowness problems identified here:

  1. NodeSynonymizer was taking 5 seconds to execute a fairly simple fetch. I determined that the CategoryManager() was trying to fetch category ancestors from an SRI endpoint that wasn't working for us, mostly because they don't support the Biolink version we're using and maybe other problems. This was solved by deprecating the CategoryManager and switching to BiolinkHelper completely. Note that the CategoryManager() was using the BiolinkHelper() already, mostly to know the current Biolink version.

  2. After that fix, the NodeSynonymizer went from 5 seconds on that test to 0.5 seconds! Yay! But, then we realized that the initial problem we were seeing was that the NodeSynonymizer was always taking 0.5 seconds no matter how trivial the task. So, I managed to fix a problem, but it wasn't the problem. We were still at 0.5 seconds and it should be much faster.

  3. After additional profiling, I narrowed the slowness down to a single line:

biolink_helper = BiolinkHelper()

whoah! It turns out that the NodeSynonymizer only needed the BiolinkHelper() rarely. But it seemed innocuous to just always set it up. So I have updated get_canonical_curies() to only create the BioinkHelper() object if it really needs it. Which is only needed sometimes. an now the NodeSynonymizer is back to its former self. Instead of taking 0.5 seconds for any and every query, we're now doing:

Elapsed time: 0.0010444819927215576

Scorching!

So the lessons are:

  1. It might be worth seeing if there is a way to speed up instantiating a BiolinkHelper() object. It seems to take 0.5 seconds. Even if all the caller wants to know "What's our current Biolink version?"
  2. Any code instantiating the BiolinkHelper() might choose to only instantiate it when absolutely necessary
  3. It would be a good programming practice in general to not instantiate complex objects unless really needed. If only I'd taken programming courses, I would surely have learned this, but instead I figured it out the hard way.

I pushed these fixes to master and deployed to test, beta, kg2beta, devED. We have now gone from:

2023-01-18T19:50:01.856687 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 1 curies
2023-01-18T19:50:02.381339 DEBUG: Got response back from NodeSynonymizer

to:

2023-01-28T07:29:43.747069 DEBUG: Sending NodeSynonymizer.get_canonical_curies() a list of 1 curies
2023-01-28T07:29:43.747468 DEBUG: Got response back from NodeSynonymizer

Overall the test query has gone from an average of 57 seconds to 40 seconds. Not a fantastic increase, but substantial enough for tonight.

edeutsch avatar Jan 28 '23 07:01 edeutsch

Adding @kvnthomas98 as he might eventually address the slow reasoning tests

dkoslicki avatar Apr 05 '23 18:04 dkoslicki