bacass
bacass copied to clipboard
remove the need for `ncbi_assembly_metadata`
Description of feature
As discussed in #170 I'm suggesting to get rid of --ncbi_assembly_metadata requirement and obtain relevant assemblies directly based on assembly IDs.
Below I provide python3 script that is able to download assemblies based on their accession (using NCBI's API).
At this moment the script downloads fasta, gff and gbff (for my convenience), this can be adjusted based on the bacass needs.
Possible interfaces:
- python import of
downloadfunction - cli (2 modes for my convenience, can be easily simplified)
dependencies:
urllib3(probably may be rewritten forurllib)
result:
- obtained data accessible under
[target dir]/ncbi_dataset/data/(can be adjusted at the cost of added complexity)
know limitations:
- works well with low to medium number of assemblies; personally, I would keep this under 50 per request. Reasonable numbers can be handled by the script (chunk iter). But if we would aim for even larger numbers (thousands) than I would advice to use
ncbi-datasets-cli(available e.g. from conda).
import urllib3
import sys
import zipfile
from io import BytesIO
def download(accs, target):
url = (f"https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/"
f"{','.join(accs)}/download?"
f"include_annotation_type=GENOME_FASTA&include_annotation_type=GENOME_GBFF&include_annotation_type=GENOME_GFF"
f"&hydrated=FULLY_HYDRATED")
with urllib3.PoolManager() as http:
with http.request("GET", url, preload_content=False) as resp:
resp.auto_close = False
with zipfile.ZipFile(BytesIO(resp.data)) as z:
z.extractall(target)
if __name__ == "__main__":
if len(sys.argv) <4:
print('USAGE: python3 dbkref.py accs [TARGET DIR] SPACE DELIMITED ASSEMBLY ACCESSIONS')
print('USAGE: python3 dbkref.py kmerfinder_summary [TARGET DIR] PATH_TO_KMERFINDER_SUMMARY_FILE')
raise(ValueError('Invalid input, please see usage.'))
target = sys.argv[2]
if sys.argv[1] == 'accs':
accs = set(sys.argv[3:])
download(accs, target)
elif sys.argv[1] == 'kmerfinder_summary':
kmersumm = sys.argv[3]
with open(kmersumm) as f:
_ = f.readline() # ditch first line
accs = {l.split(',')[1] for l in f if l.strip() != ''}
download(accs, target)
else:
raise ValueError('invalid mode choice, valid are "kmerfinder_summary" or "accs"')
If you are interested in this, could you please check, if urllib3 is available in bacass python3?
Thank you so much, @SchwarzMarek ! I've been out of office these days. 🙏🏾 I plan to test it locally by next week and share my thoughts. 😉