Potentially moving KG2c sqlite dependency out of ARAX?
After exploring the ARAX Expand code, we discovered that there is a run-time dependency on the KG2c.sqlite file, specifically to decorate edges returned by PloverDB.
Upon discussing with @saramsey and @bazarkua , an idea could be to have a mySQL version of KG2c that is hosted on an EC2 instance and can be used as a decorator API to the RTX-KG2c results returned by PloverDB.
A few stray ideas.
-
The new Translator architecture system has an "Annotator" component, which I think it high analogous to this? But it might only be for nodes, not edges
-
Is our decorator for both nodes and edges?
-
Instead of standing up another instance, I wonder if it is sensible for this just to be another functionality of PloverDB? Since it can already provide fully annotated results, it already has all the information? But it can also be asked for stripped down results for speed? Maybe it would be easy to add a mechanism/endpoint for decorating? This would remove the need for yet another instance?
Upon reflection, I have some hesitation about creating (and provisioning, deploying, maintaining, and troubleshooting) a whole dedicated hosted MySQL service just for this, especially in light of the fact that a major Translator re-architecting is coming in the next year. I wonder if we should just stick with accessing the KG2c sqlite file in an EFS volume, for the time being? I've lost track of what pain point we are trying to address by removing this dependency from ARAX? I'm sure we discussed this before, but I just need some help remembering.
There are two pain points that I am aware of:
- This file is very large (>10GB) and copying this (and other) large file to new instances has been the cause of much pain, mostly with ITRB
- As we port our ARAX functionality to Shepherd, I am reasonably confident that continuing to require that huge files be copied to all deployments is not an acceptable design. I don't think we have actually asserted that we will implement this requirement and told "NO". But it seems like a very unpleasant thing to impose on our shared code project and Max has agreed when I have stated that adding such a dependency onto Shepherd seemed not good.
OK, fair enough. And it is ever worse than that; the KG2c sqlite file is 39 GiB, I believe. And yes, you are right, it's a big pain point vis-a-vis operations. Thank you for reminding me.
On the other hand, the Explainable DTD's database is 43 GiB; the node synonymizer database is 18 GiB; and the COHD database is 38 GiB. So it would seem that we have 3–4 heavyweight sqlite database files to consider, depending on whether we consider COHD expendable or not. I wonder if we should broaden this issue to consider all of them?
This idea is, I think, dependent on the outcome of the EFS feasibility study that @sundareswarpullela and @bazarkua are doing for #2524