Need a script for converting kg2-simplified.json.gz format to KGX-TSV
Can use https://github.com/RTXteam/RTX/blob/master/code/kg2c/kg2c_tsv_to_kgx_tsv.py as a starting point (but it will need some modifications, as described in Slack).
The motivation for this feature is enabling upload of KG2pre to KGE; see issue https://github.com/RTXteam/RTX-KG2/issues/153
Once we have a KGX-TSV export of KG2.7.3, we will want to upload it to KGE. Contact Richard Bruskiewich (SRI) who can turn on the server so that we can upload it. The URL for the server to which we will upload the KGX-TSV export of KG2.7.3 is https://kge-data-staging.starinformatics.ca/ . You will have to get an account on that server (Richard can help with that).
The KGX-TSV export of KG2.7.3c is already done (and uploaded to KGE), using a custom script https://github.com/RTXteam/RTX/blob/master/code/kg2c/kg2c_tsv_to_kgx_tsv.py
Example files that "document" the KGX TSV file format are in buildkg2c.rtx.ai in the directory /home/ubuntu/steve/tiny-test-max-100-edges, as nodes.tsv and edges.tsv
Hi @acevedol commit 81fbb5e fixes an issue with publications_info in that I think we want JSONified output for that specific field within the edges.tsvfile. I also added argparse argument handling for consistency with the other KG2 python scripts.
Looks like in KG2.7.4pre, there are 11 nodes for which the description field has a hard tab in it:
ubuntu@ip-172-31-63-157:~/issue-154/RTX-KG2$ grep WARNING kg2_json_to_kgx_tsv.log | wc -l
11
No biggie, the kg2_json_to_kgx_tsv.py script is now converting those hard tabs to quad-spaces.
Ah, need to convert newlines in description field. See this kind of issue in the nodes.tsv file:
nship between two chemical entities, where the subject represents the upstream entity and the object
represents the downstream. For any such association there is an implicit reaction:
IF
R has-input C1 AND
R has-output C2 AND
R enabled-by P AND
R type Reaction
THEN
The existing script worked, but is a massive memory hog and will not work for the next time with a bigger kg2pre. Per discussion with Steve, I am experimenting with converting the json input file into JSON Lines instead of reading the entire json file into memory to work on.