genomad How to export the viral DNA gene sequence?

Hi, In the example above I can see proteins FASTA file of GCF_009025895.1_virus_proteins.faa. I want to calculate the gene abundance with virus gene sequence.Can we output the corresponding nucleotide sequence?

Thanks!

Aug 02 '24 06:08 wfgui

There is currently no option to do this, but I could implement it as a feature in the future. In the meantime, you can obtain the nucleotide sequences of the CDSs by extracting them from the genomes using the gene coordinates.

Aug 02 '24 08:08 apcamargo

I also had a seemingly simple question about whether I could format the output at taxonomy, such as converting it to k__; p__; c__; o__; f__; g__; s__.

Thanks!

Aug 14 '24 08:08 wfgui

You can use taxopy for that. geNomad's taxdump is inside the database directory, and you can find the TaxIds in the <prefox>_annotate/<prefox>_taxonomy.tsv file.

For instance:

import taxopy

taxdb = taxopy.TaxDb(
    nodes_dmp="genomad_db/nodes.dmp",
    names_dmp="genomad_db/names.dmp",
    keep_files=True
)
taxon = taxopy.Taxon(5797, taxdb)
for rank, name in reversed(taxon.ranked_name_lineage):
    if name != "root":
        print(f"{rank}__{name}")

realm__Duplodnaviria
kingdom__Heunggongvirae
phylum__Uroviricota
class__Caudoviricetes
order__Crassvirales

Aug 19 '24 20:08 apcamargo

What's the difference between "Unclassified" and "Viruses;;;;;;" ?

Aug 23 '24 05:08 wfgui

"Unclassified" means that the genes in the sequence had no matches to markers with taxonomy information. "Viruses" means that the classification is uncertain at a high rank.

Aug 23 '24 06:08 apcamargo

Hi, In the example above I can see proteins FASTA file of GCF_009025895.1_virus_proteins.faa. I want to calculate the gene abundance with virus gene sequence.Can we output the corresponding nucleotide sequence?

Thanks!

I had a very primitive idea. With the help of AI, I modified the prodigal.py file. I didn't carefully examine this code, but perhaps it can offer you some assistance

import multiprocessing.pool
import re
from pathlib import Path

import pyrodigal_gv
from genomad import sequence
_RC_TABLE = str.maketrans(
    "ACGT",   
    "TGCA"      
)

def reverse_complement(seq: str) -> str:
    """Return the reverse-complement of a DNA/RNA string (handles IUPAC codes)."""
    return seq.translate(_RC_TABLE)[::-1]



_orf_finder = pyrodigal_gv.ViralGeneFinder(meta=True, mask=True)


def _predict_genes(seq):
    return (seq.accession, _orf_finder.find_genes(seq.seq))


class Prodigal:
    def __init__(self, input_file: Path, prodigal_output: Path) -> None:
        self.input_file = input_file
        self.prodigal_output = prodigal_output

    def run_parallel_prodigal(self, threads: int) -> None:
        contigs = list(sequence.read_fasta(self.input_file))
        dna_lookup = {rec.accession: rec.seq for rec in contigs}

        geneout_path = self.prodigal_output.with_suffix(".fna")    

        with (
            multiprocessing.pool.Pool(threads) as pool,
            open(self.prodigal_output, "w") as faa,   
            open(geneout_path, "w") as fna,           
        ):
            for seq_idx, (seq_acc, predicted_genes) in enumerate(
                pool.imap(_predict_genes, contigs, chunksize=5), 1
            ):
                dna = dna_lookup[seq_acc]             

                for gene_idx, g in enumerate(predicted_genes, 1):
                    header = (
                        f"{seq_acc}_{gene_idx} # {g.begin} # {g.end} # {g.strand} # "
                        f"ID={seq_idx}_{gene_idx};"
                        f"partial={int(g.partial_begin)}{int(g.partial_end)};"
                        f"start_type={g.start_type};rbs_motif={g.rbs_motif};"
                        f"rbs_spacer={g.rbs_spacer};"
                        f"genetic_code={g.translation_table};gc_cont={g.gc_cont:.3f}"
                    )

 
                    nuc_seq = dna[g.begin - 1 : g.end]      # 1-based -> 0-based
                    if g.strand == -1:
                        nuc_seq = reverse_complement(nuc_seq)
                    fna.write(str(sequence.Sequence(header, nuc_seq)))

                    prot_seq = g.translate(include_stop=False)
                    faa.write(str(sequence.Sequence(header, prot_seq)))



    def proteins(self):
        header_parser = re.compile(
            r"(.+)_(.+) # ([0-9]+) # ([0-9]+) # (-1|1) .+rbs_motif=(.+?)"
            r";.+;genetic_code=(.+?);gc_cont=(.+)"
        )
        if not self.prodigal_output.is_file():
            raise FileNotFoundError(f"{self.prodigal_output} was not found.")
        for seq in sequence.read_fasta(self.prodigal_output):
            contig, gene, start, end, strand, rbs, code, gc = header_parser.match(
                seq.header
            ).groups()
            yield (
                contig,
                gene,
                int(start),
                int(end),
                int(strand),
                rbs,
                int(code),
                float(gc),
            )

Jul 03 '25 08:07 JustinRaoV

genomad genomad copied to clipboard

How to export the viral DNA gene sequence?

genomad
genomad copied to clipboard