galaxy icon indicating copy to clipboard operation
galaxy copied to clipboard

Implement initial GA4GH DRS Support.

Open jmchilton opened this issue 2 years ago • 11 comments

Builds on #13947.

  • Implement APIs for serving Galaxy's public datasets as DRS objects.
  • Implement support in galaxy.files.uris for DRS URIs so that upload, etc... can use DRS URIs from other data repositories.

Together these represent effectively both client and server or producer and consumer support for the DRS API.

The ways this implementation can be improved/extended are endless:

  • New object ID patterns can be implemented to allow access to composite data, metadata files from a bundle view of datasets.
  • New object ID patterns can be implemented to allow access to collections as bundles of datasets.
  • More auth types and AccessMethods could be implemented for loading datasets from clients.
  • Some sort of scheme for providing authenticated URLs to non-public data could be implemented to provide access to more of Galaxy's data if user's allow/enable/etc....

Despite the limitations - I think having the guts in like this provide useful points for interested parties to go and add functionality to enable their desired use cases.

How to test the changes?

(Select all options that apply)

License

  • [x] I agree to license these contributions under Galaxy's current license.
  • [x] I agree to allow the Galaxy committers to license these and all my past contributions to the core galaxy codebase under the MIT license. If this condition is an issue, uncheck and just let us know why with an e-mail to [email protected].

jmchilton avatar May 22 '22 20:05 jmchilton

Thanks a lot @jmchilton! We discussed last week if we can expose our reference data via DRS as well? Do you think that would be possible as a follow-up?

bgruening avatar May 23 '22 09:05 bgruening

@bgruening I think it should be fairly straight forward - at least the file data. I think the only concern is how to you create an "object ID" for the data. But that might be as simple as ref-<dbkey>-<table-name>. I guess there is value in the non-file reference data? If so - maybe we can include a virtual file in the result that grabs the metadata as a json or something.

Do we have a driving use case for this?

jmchilton avatar May 23 '22 15:05 jmchilton

I guess there is value in the non-file reference data? If so - maybe we can include a virtual file in the result that grabs the metadata as a json or something.

We could use the manifest file from ROs to describe the metadata and offer it as a standard archive.

Do we have a driving use case for this?

Transparency, each Galaxy server is offering a lot of great reference data, but it's not possible to look at them or reuse them (if not on CVMFS). We have been discussing a Galaxy page that at least shows all reference data, e.g. this Galaxy server offers you those 200 reference genomes. A logical step would be to offer them as download or query them by API.

bgruening avatar May 23 '22 19:05 bgruening

I'm gonna bump this to 22.09, I guess we should at least decide on the "format" of the object id

mvdbeek avatar Jun 14 '22 15:06 mvdbeek

Note to myself, for the reference data we might also, maybe in addition, offer http://samtools.github.io/hts-specs/refget.html as an API.

bgruening avatar Jun 21 '22 00:06 bgruening

The reason for it being both consumer and producer in one PR is that I wanted to implement the consumer functionality but I had nothing to test against - I couldn't find an existing server to test against and the reference implementation was pretty rough and I decided it would be hard to integrate.

Give me a week or two to polish the server aspect and if I can't get it done - I'll be willing to spin out the consumer and just not have tests.

jmchilton avatar Sep 06 '22 13:09 jmchilton

I've rebased this and extended it to address some of the reviewer comments.

Two bigs changes:

  • checksums: If no checksums are available - #14576 is now used to compute a checksum and issue a 202.
  • dataset IDs: I prefix the IDs with hda- or ldda- to make it clear how multiple different patterns could exist in the future. It also uses a DRS specific ID encoding that can be configured independently of Galaxy's ID encoding key.

jmchilton avatar Sep 09 '22 17:09 jmchilton

@jmchilton is this still WIP?

bgruening avatar Sep 15 '22 16:09 bgruening

@bgruening if you're willing to merge it consider it out of WIP 😆.

jmchilton avatar Sep 15 '22 17:09 jmchilton

I am very happy with this PR :)

In this tutorial https://github.com/ga4gh/ismb-2022-ga4gh-tutorial/blob/main/sessions/session2/2-1%20Basic%20DRS.ipynb is this DRS URI: https://locate.be-md.ncbi.nlm.nih.gov//ga4gh/drs/v1/objects/fb1cfb04d3ef99d07c21f9dbf87ccc68

I will try to find some proper internet to test it.

bgruening avatar Sep 16 '22 14:09 bgruening

It seems I am not able to download drs:// URLs from the above-linked server. The link is downloaded as text instead.

bgruening avatar Sep 19 '22 22:09 bgruening

@bgruening Google is failing me here but do have access to a DRS ID that isn't 8 GB large. I've put in some print statements though and it does seem like the code is trying to get the data from the right place. Is it possible you used:

https://locate.be-md.ncbi.nlm.nih.gov/ga4gh/drs/v1/objects/fb1cfb04d3ef99d07c21f9dbf87ccc68

and not

drs://locate.be-md.ncbi.nlm.nih.gov/fb1cfb04d3ef99d07c21f9dbf87ccc68

I don't think the tutorial notebook made it... clear I guess that the latter is the "correct" URI for the target file?

diff --git a/test/unit/files/test_uris.py b/test/unit/files/test_uris.py
index 36732a8032..b23d3cdcf7 100644
--- a/test/unit/files/test_uris.py
+++ b/test/unit/files/test_uris.py
@@ -5,6 +5,7 @@ from galaxy.exceptions import (
     ConfigDoesNotAllowException,
 )
 from galaxy.files.uris import (
+    stream_url_to_str,
     validate_non_local,
     validate_uri_access,
 )
@@ -53,3 +54,15 @@ def validates(uri: str, is_admin, allow_list):
     except (ConfigDoesNotAllowException, AdminRequiredException):
         return False
     return True
+
+
+def test_drs():
+    # https://github.com/ga4gh/ismb-2022-ga4gh-tutorial/blob/main/sessions/session2/2-1%20Basic%20DRS.ipynb
+    hostname = "locate.be-md.ncbi.nlm.nih.gov"
+    host_url = f'https://{hostname}'
+    drs_id = 'fb1cfb04d3ef99d07c21f9dbf87ccc68'
+    full_url = host_url + '/ga4gh/drs/v1/objects/' + drs_id
+
+    drs_url = f"drs://{hostname}/{drs_id}"
+    contents = stream_url_to_str(drs_url)
+    print(contents)

jmchilton avatar Nov 08 '22 09:11 jmchilton

I will test this again tonight, thanks for checking John!

bgruening avatar Nov 08 '22 17:11 bgruening