snakemake-wrappers icon indicating copy to clipboard operation
snakemake-wrappers copied to clipboard

ENSEMBL-SEQUENCE does not work for all species

Open lczech opened this issue 7 months ago • 4 comments

Snakemake version Snakemake: 8.15.2 Wrapper: "v3.13.6/bio/reference/ensembl-sequence"

Describe the bug The path for downloading has a hard-coded structure in the wrapper:

spec = ("{build}" if int(release) > 75 else "{build}.{release}").format(
    build=build, release=release
)
url_prefix = f"{url}/{branch}release-{release}/fasta/{species}/{datatype}/{species.capitalize()}.{spec}"

This uses a hard check for > 75. However, for some species, the path structure differs, for instance A. thaliana is currently in plants release 59, but does not have the above hard-coded extra release number in the spec part of the filename.

The correct file name is

Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz

but instead the wrapper is only checking for

Arabidopsis_thaliana.TAIR10.59.[dna.primary_assembly.fa.gz|dna.toplevel.fa.gz]

which has the additional 59 that should not be there. Hence, the download fails. I think a simple fix is to avoid the hard-coded 75, and instead check both variants of the path.

lczech avatar Jul 12 '24 13:07 lczech