nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

[Feature Request] Support for querying the SRA metadata using AWS Athena and Google BigQuery

Open abhi18av opened this issue 3 years ago • 14 comments

New feature

The recent collaboration between NCBI and the cloud providers allows one to query the entire archive based on the metadata in AWS Athena.

Here are some relevant resources for the same

https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud/

https://registry.opendata.aws/ncbi-sra/

https://www.ncbi.nlm.nih.gov/sra/docs/sra-athena-examples/

https://www.youtube.com/playlist?list=PLH-TjWpFfWrt5MNqU7Jvsk73QefO3ADwD

NOTE: The same could be done for GCP cloud as well, for now I've not created a separate issue for that.

Suggested implementation

I'm sure there must be a more elegant implementation but as an initial draft for this implementation, we could implement this in a couple of ways

  1. As a separate method fromNCBI, which allows one to pass a closure based query for any particular database from NCBI.
def ncbi_query = { db, orgnsm -> 
"""
SELECT *
FROM $db.metadata 
WHERE organism = $orgnsm
limit 10 
"""
}

Channel.fromNcbi ( query: ncbi_query("SRA", "Homo Sapien") )
  1. Or as a more specialized enhancement of the fromSRA method, which allows a closure to be passed to the query field. For example,
def ncbi_query = { db, orgnsm -> 
"""
SELECT *
FROM $db.metadata 
WHERE organism = $orgnsm
limit 10 
"""
}

Channel.fromSRA ( query: ncbi_query("SRA", "Homo Sapien") )

Related https://github.com/nextflow-io/nextflow/issues/1605

abhi18av avatar Jan 04 '21 11:01 abhi18av

This could overlap with https://github.com/nextflow-io/nextflow/pull/1611

abhi18av avatar Jan 04 '21 11:01 abhi18av

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Oct 22 '21 07:10 stale[bot]

This is a good candidate for a Nextflow plugin along the same way of nf-sqldb

pditommaso avatar Oct 22 '21 08:10 pditommaso

I'd be happy to give this a shot 👍

abhi18av avatar Oct 22 '21 10:10 abhi18av

Update:

The work is being done on my fork as of now https://github.com/abhi18av/nextflow/tree/abhinav/nf-sraql , with BigQuery as the default source.

Once it is presentable, I'll create and link the PR to this repo.

abhi18av avatar Nov 04 '21 14:11 abhi18av

Cool! Willing to make a PR so changes will be more clear?

pditommaso avatar Nov 10 '21 08:11 pditommaso

Absolutely, will make a PR ~by EOD today~ 👍

Initiated the draft PR with the scratch work, happy to receive any feedback.

abhi18av avatar Nov 10 '21 09:11 abhi18av

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 16 '22 01:04 stale[bot]

WIP - not stale.

abhi18av avatar Apr 16 '22 13:04 abhi18av

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 20 '22 20:09 stale[bot]

Whoa, this went under my radar after the health crisis. Confirming @pditommaso if this is still relevant and I'd be happy to pick this back up and make a push

abhi18av avatar Sep 21 '22 07:09 abhi18av

Not a priority but surely a nice to have. Should not this working via db jdbc connection? What's missing?

pditommaso avatar Sep 21 '22 10:09 pditommaso

I think it is already working for BigQuery, but I needed to accommodate paging issues for large set of results.

abhi18av avatar Sep 21 '22 12:09 abhi18av

The most useful thing it would be an example in the readme. without that nobody will even know it exists

pditommaso avatar Sep 21 '22 12:09 pditommaso

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 18 '23 09:03 stale[bot]