nextflow
nextflow copied to clipboard
[Feature Request] Support for querying the SRA metadata using AWS Athena and Google BigQuery
New feature
The recent collaboration between NCBI
and the cloud providers allows one to query the entire archive based on the metadata in AWS Athena.
Here are some relevant resources for the same
https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud/
https://registry.opendata.aws/ncbi-sra/
https://www.ncbi.nlm.nih.gov/sra/docs/sra-athena-examples/
https://www.youtube.com/playlist?list=PLH-TjWpFfWrt5MNqU7Jvsk73QefO3ADwD
NOTE: The same could be done for GCP cloud as well, for now I've not created a separate issue for that.
Suggested implementation
I'm sure there must be a more elegant implementation but as an initial draft for this implementation, we could implement this in a couple of ways
- As a separate method
fromNCBI
, which allows one to pass a closure based query for any particular database from NCBI.
def ncbi_query = { db, orgnsm ->
"""
SELECT *
FROM $db.metadata
WHERE organism = $orgnsm
limit 10
"""
}
Channel.fromNcbi ( query: ncbi_query("SRA", "Homo Sapien") )
- Or as a more specialized enhancement of the
fromSRA
method, which allows a closure to be passed to thequery
field. For example,
def ncbi_query = { db, orgnsm ->
"""
SELECT *
FROM $db.metadata
WHERE organism = $orgnsm
limit 10
"""
}
Channel.fromSRA ( query: ncbi_query("SRA", "Homo Sapien") )
Related https://github.com/nextflow-io/nextflow/issues/1605
This could overlap with https://github.com/nextflow-io/nextflow/pull/1611
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This is a good candidate for a Nextflow plugin along the same way of nf-sqldb
I'd be happy to give this a shot 👍
Update:
The work is being done on my fork as of now https://github.com/abhi18av/nextflow/tree/abhinav/nf-sraql , with BigQuery
as the default source.
Once it is presentable, I'll create and link the PR to this repo.
Cool! Willing to make a PR so changes will be more clear?
Absolutely, will make a PR ~by EOD today~ 👍
Initiated the draft PR with the scratch work, happy to receive any feedback.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
WIP - not stale.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Whoa, this went under my radar after the health crisis. Confirming @pditommaso if this is still relevant and I'd be happy to pick this back up and make a push
Not a priority but surely a nice to have. Should not this working via db jdbc connection? What's missing?
I think it is already working for BigQuery, but I needed to accommodate paging issues for large set of results.
The most useful thing it would be an example in the readme. without that nobody will even know it exists
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.