genome_updater icon indicating copy to clipboard operation
genome_updater copied to clipboard

Feature request: allow wildcard filtering based on assembly name

Open jdwinkler-lanzatech opened this issue 1 year ago • 4 comments

Hi,

I was wondering if it would be possible to provide a filtering option based on assembly (species/assigned) name? I often want to pull a group of microbes with a general metabolic capabilities (say methanogenesis) but I have to manually pick out the TaxIDs currently to do so. Not a major problem, but the feature might be useful for other people too!

jdwinkler-lanzatech avatar Aug 21 '22 23:08 jdwinkler-lanzatech

Hi, thanks for the suggestion. genome_updater selects and filters data based on the assembly_summary.txt file provided by NCBI (more info https://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt). Besides the filter parameters, the -F option allow custom filtering for data selection. However, I'm not sure the information you refer to is contained in that file.

pirovc avatar Aug 22 '22 12:08 pirovc

Column 8 would be the target, I think. I believe right now the -F option is an exact match though, so I am thinking of another flag that basically uses grep behind the scenes to implement the matching. I'd basically want to grab all the assemblies with an organism name matching "methano*", if that makes sense. Obviously would not be perfect, but could be handy if you have a specific enough search string.

jdwinkler-lanzatech avatar Aug 22 '22 13:08 jdwinkler-lanzatech

Partial matching should be doable, will mark it as enhancement. For now one can download the full assembly_summary.txt from genbank or refseq and apply the filter/grep manually and use the resulting file as an external assembly_summary.txt (param. -e).

pirovc avatar Aug 24 '22 11:08 pirovc

Great, thanks! I figure it is a logical addition to the custom filtering offered by -F already.

jdwinkler-lanzatech avatar Aug 24 '22 13:08 jdwinkler-lanzatech