bacass remove the need for `ncbi_assembly

remove the need for `ncbi_assembly_metadata`

Open SchwarzMarek opened this issue 1 year ago • 1 comments

trafficstars

Description of feature

As discussed in #170 I'm suggesting to get rid of --ncbi_assembly_metadata requirement and obtain relevant assemblies directly based on assembly IDs.

Below I provide python3 script that is able to download assemblies based on their accession (using NCBI's API).

At this moment the script downloads fasta, gff and gbff (for my convenience), this can be adjusted based on the bacass needs.

Possible interfaces:

python import of download function
cli (2 modes for my convenience, can be easily simplified)

dependencies:

urllib3 (probably may be rewritten for urllib)

result:

obtained data accessible under [target dir]/ncbi_dataset/data/ (can be adjusted at the cost of added complexity)

know limitations:

works well with low to medium number of assemblies; personally, I would keep this under 50 per request. Reasonable numbers can be handled by the script (chunk iter). But if we would aim for even larger numbers (thousands) than I would advice to use ncbi-datasets-cli (available e.g. from conda).

import urllib3
import sys
import zipfile
from io import BytesIO


def download(accs, target):
    url = (f"https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/"
           f"{','.join(accs)}/download?"
           f"include_annotation_type=GENOME_FASTA&include_annotation_type=GENOME_GBFF&include_annotation_type=GENOME_GFF"
           f"&hydrated=FULLY_HYDRATED")

    with urllib3.PoolManager() as http:
        with http.request("GET", url, preload_content=False) as resp:
            resp.auto_close = False

            with zipfile.ZipFile(BytesIO(resp.data)) as z:
                z.extractall(target)


if __name__ == "__main__":
    if len(sys.argv) <4:
        print('USAGE: python3 dbkref.py accs [TARGET DIR] SPACE DELIMITED ASSEMBLY ACCESSIONS')
        print('USAGE: python3 dbkref.py kmerfinder_summary [TARGET DIR] PATH_TO_KMERFINDER_SUMMARY_FILE')
        raise(ValueError('Invalid input, please see usage.'))
        
    target = sys.argv[2]
    if sys.argv[1] == 'accs':
        accs = set(sys.argv[3:])
        download(accs, target)
    elif sys.argv[1] == 'kmerfinder_summary':
        kmersumm = sys.argv[3]
        with open(kmersumm) as f:
            _ = f.readline()  # ditch first line
            accs = {l.split(',')[1] for l in f if l.strip() != ''}
        download(accs, target)
    else:
        raise ValueError('invalid mode choice, valid are "kmerfinder_summary" or "accs"')

If you are interested in this, could you please check, if urllib3 is available in bacass python3?

dbkref.py.zip

Oct 02 '24 08:10 SchwarzMarek

Thank you so much, @SchwarzMarek ! I've been out of office these days. 🙏🏾 I plan to test it locally by next week and share my thoughts. 😉

Oct 04 '24 16:10 Daniel-VM

bacass bacass copied to clipboard

remove the need for `ncbi_assembly_metadata`

Description of feature

bacass
bacass copied to clipboard