pyani icon indicating copy to clipboard operation
pyani copied to clipboard

Fetching only n genomes with genbank_get_genomes_by_taxon.py

Open sbridel opened this issue 6 years ago • 5 comments

Summary:

Basically, there is sometimes 100+ genomes fetched from a taxonomic ID (even at species lvl) genbank_get_genomes_by_taxon.py

Is it possible to add an option like: genbank_get_genomes_by_taxon.py -max 5 to fetch only 5 genomes.

And if you enter the taxonomic ID, it will be nice to have the possibility to fetch only 1 or 2 genomes or each species within the genus and exclude / include Genus sp. (unidentified species within the genus)

Simply it's only suggestions.

Pyani is still awesome

sbridel avatar Sep 11 '17 08:09 sbridel

Currently, there is an option --subsample which will randomly choose n genomes from within the supplied set of genomes in the input directory. That will do part of what you want - reduce the number of genomes being used in the analysis.

I'm reluctant to implement exactly what you ask for, as there are more practical issues with following up apparent problems (e.g. misidentified taxa) when downloading random subsets, than when subsampling from a local directory. Also, the issue of misnaming/misclassification makes it difficult to be sure that the subgroup one downloads is representative - more so when few sequences are downloaded - and especially makes it risky to assume that the assigned nomenclature reflects (or does not) the other organisms with that nomenclature, or the clade you want to download.

If I were in the position that I wanted to download only four or five randomly-chosen genomes for a species, I'd do it manually - which is likely as quick as running the script ;)

widdowquinn avatar Sep 26 '17 13:09 widdowquinn

I would like to mention here that sometimes I like to use genbank_get_genomes_by_taxon.py to gather reference genomes for larger taxonomic groups (sometimes up to phylum-level). Especially in cases where clinically relevant species are involved, that can really bloat up the downloaded dataset (can be several thousand genomes, even if there only a few hundred species) due to multiple published genomes of basically identical strains (and sometimes even different versions of the same genome)

In those cases, I usually gather a list of type strains for each species, and automatically filter the result to remove all non-type-strains from the downloaded-dataset, at least for all species of which the type strain was sequenced. In cases of species where only non-type strains have been sequenced (also happens quite often), I select the strain with the largest genome that contains protein annotations with the least number of contigs.

Assuming these infos might also be available directly in entrez (type strain, genome size, number of contigs, number of CDS), maybe this could be incorporated as an option in the script? To me this seems more generally applicable than random subsampling, as this approach favours the most complete representative for each species, and would reduce the amount of unwanted downloads.

jvollme avatar Oct 18 '17 09:10 jvollme

I like the idea of restricting downloads only to those indicated as representative genomes. These should be well-defined in the database and reproducible, and the choice controllable by a single switch (e.g. --reponly). I will have to look at the Entrez/EUtils docs to see what's needed, but it sounds feasible and useful.

I'm less keen on the ad hoc rules for selecting a pseudo-type strain, as even from the short criteria set you mention there's more than one reasonable interpretation (e.g. should we prefer fewer contigs or slightly longer genome?). Restricting to genomes --with-cds, --min-size, and --max-contigsis pretty feasible, but some of the niceties seem like they could be a little troublesome: some submitted sequences might have many short contigs that other submissions discard, and we might only care about contigs over a minimum length (depending on how/when sequenced, etc.) - I'm interested in implementing those options, but they'll likely come in v0.3 (currently classify branch).

A note of caution with type strains: some of the existing NCBI classifications under a single species-level taxID look - to me - to be clearly quite different species. Also, for one species we care a lot about here, the nominated type strain is about the least representative genome available so far, out of around 60 available! ;)

widdowquinn avatar Oct 18 '17 13:10 widdowquinn

Hi, yes I also found that the type strains are not necessarily actually representative for a species. Often they were just the first or most easy to Isolate. It's just that, when you limit a set of reference organisms because you only need (or can only handle) about one representative per species, it's often easiest to argue that you simply selected the type strains. But instead restricting it to the genomes selected at "representative genome" for any species by NCBI seems better (I guess they probably have some quality criteria involved in that selection also).

I agree that selecting "the one" representative for each species is tricky and highly dependent on what your specific goal is. I'd be especially happy about the "--with-cds" and "--max-contigs" options though, because that would help to exclude non-annotated and extremely fragmented draft genomes.

jvollme avatar Oct 20 '17 09:10 jvollme

Specific interface changes - a checklist

  • [ ] add --taxid-file (or similar) to use a predefined set of taxIDs; the format of the associated file should be discussed before implementation
  • [ ] add --type-only option to pyani download so that only Type species (as recorded at NCBI) are downloaded
  • [ ] add --with-cds option to pyani download so that only annotated genomes are downloaded
  • [ ] add --max-contigs option to pyani download so that only genomes with more than a specified number of contigs are downloaded; related question - is it possible to get enough information such that a minimum contig length can be defined when making this count of contig numbers?

widdowquinn avatar May 09 '22 15:05 widdowquinn