RTX-KG2 icon indicating copy to clipboard operation
RTX-KG2 copied to clipboard

Start naming KG2 TSV tarball with version number (in S3)?

Open amykglen opened this issue 3 years ago • 10 comments

Instead of extracting KG2pre data via Neo4j, going forward the KG2c build process is going to ingest kg2-tsv-for-neo4j.tar.gz (downloaded from the rtx-kg2 S3 bucket).

Wondering if it would be reasonable to start naming that tarball in S3 with the KG2 version number? So, something like: kg2-7-2-tsv-for-neo4j.tar.gz

I realize that means we'd have to periodically delete old versions of the file so the S3 bucket doesn't get overly full, but it'd be really nice for the KG2c build process to be able to make sure it gets the right tarball (since currently the tarball is overwritten every time a new KG2pre build is done).

amykglen avatar Aug 27 '21 06:08 amykglen

OK, I am thinking about how to do this while still preserving automation in the tsv-to-neo4j.sh script.

saramsey avatar Aug 31 '21 17:08 saramsey

I have created branch issue-140 for working this issue

saramsey avatar Aug 31 '21 17:08 saramsey

I have a mini build-system for the issue-140 branch working on my MBP, for development/test purposes for this issue.

saramsey avatar Aug 31 '21 17:08 saramsey

Lili and I discussed it and we feel this issue may slip until after 2.7.4

saramsey avatar Oct 19 '21 21:10 saramsey

Wondering if we can prioritize this for the next few weeks? @acevedol and @ecwood do you think it is doable?

saramsey avatar Aug 21 '23 23:08 saramsey

I'm specifically thinking that the output filenames that go to the S3 bucket should have the version number in the filename. I don't think the filenames on buildkg2.rtx.ai or whatever need to have the version number in the filename. Does that simplify things somewhat?

In hindsight, I don't think my decision to copy files like kg2-simplified.json to the S3 bucket without a version number in the filename, was a very good choice. Too much chance for confusion. It puts us in the position of having to check MD5 hashes or inspect the RTX:KG2 node in order to be sure which version the file is. We end up doing a surprising amount of that, and it seems like it could mostly be avoided if the S3 file artifacts had the version number embedded. Or were stored in a version--named folder on S3 (to avoid clutter in the bucket).

saramsey avatar Aug 21 '23 23:08 saramsey

I can try to work on this in the next few weeks. I like the idea of a version-named folder on S3 to avoid clutter.

ecwood avatar Aug 21 '23 23:08 ecwood

is there any way this could be implemented soon? really all we would like is that the kg2-tsv-for-neo4j.tar.gz in S3 is somehow named with its version number - either in the filename itself or by putting it in a subdirectory for that version. no need to change the file name within the KG2pre build itself (just upon upload to S3). it would be a big help for improving the robustness of KG2c builds.

amykglen avatar Jul 15 '24 18:07 amykglen

This should be done now. It will look something like kg2-tsv-for-neo4j-KG2.X.Y.tar.gz in the next build.

ecwood avatar Jul 17 '24 21:07 ecwood

awesome, thank you!!

amykglen avatar Jul 17 '24 22:07 amykglen

This was successful in the KG2.10.1pre build, other than the report compare (as described here: https://github.com/RTXteam/RTX-KG2/issues/408#issuecomment-2336826509), so I am closing out this issue.

ecwood avatar Sep 08 '24 21:09 ecwood

Thank you @ecwood for doing this!

saramsey avatar Sep 09 '24 16:09 saramsey