anvio icon indicating copy to clipboard operation
anvio copied to clipboard

[FEATURE REQUEST] adding contigs database names to deflines of exported genes/proteins fasta

Open dspeth opened this issue 1 year ago • 5 comments

The need

Currently, when exporting gene/protein fasta files from genomes using anvi-get-sequences-for-gene-calls, the identifiers are just numbers, whereas in the GFF export the IDs are in the format <contigs_db_name>___. I would love these consistent, and stongly prefer the second format with the three underscores.

The solution

Either change the default of how locus tags are exported, or perhaps more elegantly, add a --name-in-defline (or something like that) option to anvi-get-sequences-for-gene-calls in non-gff mode

Beneficiaries

Those who export sequences from many files, and don't want all of them to start at "0"

dspeth avatar Aug 10 '24 20:08 dspeth

Thanks for this, @dspeth. I think it is time we think about a flexible way to export defline information for all FASTA files. In an ideal world, the user should be able to specify exactly how they would like the defline of their FASTA file should look like. For instance, if there was a flag for these programs that export FASTA files like --defline, we could use it the following way:

(...) --defline '{genome_name}_{gene_caller_id} {gene_function}

And it would give a FASTA file that looks like this:

>HTCC1060_3 COG:XX;KOfam:YY;PFam:ZZ
(...)

Versus,

(...) --defline '{gene_caller_id} {genome_name}

Would yield something like,

>3 genome_name:HTCC1060
(...)

Versus,

(...) --defline '{gene_caller_id}'

would yield,

>3
(...)

The technical problem here is that there are many places in the code where deflines are being defined on the fly. We can implement a global flag, --list-defline-options which would be caught everywhere in the code that is crafting deflines, and would share the 'keys' that can be used in that specific context, and then the user would use those keys with the global --defline flag, that would also be captured in the same context to divert from the defaults of the context.

I'll try to think about this more once I have time, but if someone wants to take a stab, they should feel free to do so in the meantime :)

meren avatar Aug 13 '24 08:08 meren

Hi Meren, your proposed general solution is much more expansive than what I suggested. If that's doable, i'd of course be happy with that. Otherwise, having the option treat the defline the same as the in the GFF output so that there's internal consistency within the anvi-get-sequences-for-gene-calls would also already be a small step forward.

That said, I get the desire to do this once, and do it right, rather hacking a patch together. It's also not urgent from my side, so far I've just postprocessed the fasta headers, but that seems like an unnecessary step.

dspeth avatar Aug 13 '24 13:08 dspeth

This branch is finally merged. Thank you again for your input and guidance, @dspeth.

meren avatar Sep 25 '24 10:09 meren

Hi @meren,

I was running anvi-export-functions, and was wondering whether the "--defline-format F-STRING" could also easily be implemented in that program. While the file generated is not a fasta file, it would still be convenient to have the deflines of the annotation table exported from anvi-export-functions consistent with those from exported fasta files.

dspeth avatar Apr 02 '25 08:04 dspeth

Hey @dspeth,

Very sorry for the late reply. 2 weeks ago I was at a conference, and last week I was on a break, so this kept being pushed. I will look into this as soon as I can, but may still take a few days. I'm sorry and thank you for your patience!

meren avatar Apr 14 '25 11:04 meren