genomad icon indicating copy to clipboard operation
genomad copied to clipboard

How to export the viral DNA gene sequence?

Open wfgui opened this issue 1 year ago • 5 comments

Hi, In the example above I can see proteins FASTA file of GCF_009025895.1_virus_proteins.faa. I want to calculate the gene abundance with virus gene sequence.Can we output the corresponding nucleotide sequence?

Thanks!

wfgui avatar Aug 02 '24 06:08 wfgui

There is currently no option to do this, but I could implement it as a feature in the future. In the meantime, you can obtain the nucleotide sequences of the CDSs by extracting them from the genomes using the gene coordinates.

apcamargo avatar Aug 02 '24 08:08 apcamargo

I also had a seemingly simple question about whether I could format the output at taxonomy, such as converting it to k__; p__; c__; o__; f__; g__; s__.

Thanks!

wfgui avatar Aug 14 '24 08:08 wfgui

You can use taxopy for that. geNomad's taxdump is inside the database directory, and you can find the TaxIds in the <prefox>_annotate/<prefox>_taxonomy.tsv file.

For instance:

import taxopy

taxdb = taxopy.TaxDb(
    nodes_dmp="genomad_db/nodes.dmp",
    names_dmp="genomad_db/names.dmp",
    keep_files=True
)
taxon = taxopy.Taxon(5797, taxdb)
for rank, name in reversed(taxon.ranked_name_lineage):
    if name != "root":
        print(f"{rank}__{name}")
realm__Duplodnaviria
kingdom__Heunggongvirae
phylum__Uroviricota
class__Caudoviricetes
order__Crassvirales

apcamargo avatar Aug 19 '24 20:08 apcamargo

1 What's the difference between "Unclassified" and "Viruses;;;;;;" ?

wfgui avatar Aug 23 '24 05:08 wfgui

"Unclassified" means that the genes in the sequence had no matches to markers with taxonomy information. "Viruses" means that the classification is uncertain at a high rank.

apcamargo avatar Aug 23 '24 06:08 apcamargo

Hi, In the example above I can see proteins FASTA file of GCF_009025895.1_virus_proteins.faa. I want to calculate the gene abundance with virus gene sequence.Can we output the corresponding nucleotide sequence?

Thanks!

I had a very primitive idea. With the help of AI, I modified the prodigal.py file. I didn't carefully examine this code, but perhaps it can offer you some assistance

import multiprocessing.pool
import re
from pathlib import Path

import pyrodigal_gv
from genomad import sequence
_RC_TABLE = str.maketrans(
    "ACGT",   
    "TGCA"      
)

def reverse_complement(seq: str) -> str:
    """Return the reverse-complement of a DNA/RNA string (handles IUPAC codes)."""
    return seq.translate(_RC_TABLE)[::-1]



_orf_finder = pyrodigal_gv.ViralGeneFinder(meta=True, mask=True)


def _predict_genes(seq):
    return (seq.accession, _orf_finder.find_genes(seq.seq))


class Prodigal:
    def __init__(self, input_file: Path, prodigal_output: Path) -> None:
        self.input_file = input_file
        self.prodigal_output = prodigal_output

    def run_parallel_prodigal(self, threads: int) -> None:
        contigs = list(sequence.read_fasta(self.input_file))
        dna_lookup = {rec.accession: rec.seq for rec in contigs}

        geneout_path = self.prodigal_output.with_suffix(".fna")    

        with (
            multiprocessing.pool.Pool(threads) as pool,
            open(self.prodigal_output, "w") as faa,   
            open(geneout_path, "w") as fna,           
        ):
            for seq_idx, (seq_acc, predicted_genes) in enumerate(
                pool.imap(_predict_genes, contigs, chunksize=5), 1
            ):
                dna = dna_lookup[seq_acc]             

                for gene_idx, g in enumerate(predicted_genes, 1):
                    header = (
                        f"{seq_acc}_{gene_idx} # {g.begin} # {g.end} # {g.strand} # "
                        f"ID={seq_idx}_{gene_idx};"
                        f"partial={int(g.partial_begin)}{int(g.partial_end)};"
                        f"start_type={g.start_type};rbs_motif={g.rbs_motif};"
                        f"rbs_spacer={g.rbs_spacer};"
                        f"genetic_code={g.translation_table};gc_cont={g.gc_cont:.3f}"
                    )

 
                    nuc_seq = dna[g.begin - 1 : g.end]      # 1-based -> 0-based
                    if g.strand == -1:
                        nuc_seq = reverse_complement(nuc_seq)
                    fna.write(str(sequence.Sequence(header, nuc_seq)))

                    prot_seq = g.translate(include_stop=False)
                    faa.write(str(sequence.Sequence(header, prot_seq)))



    def proteins(self):
        header_parser = re.compile(
            r"(.+)_(.+) # ([0-9]+) # ([0-9]+) # (-1|1) .+rbs_motif=(.+?)"
            r";.+;genetic_code=(.+?);gc_cont=(.+)"
        )
        if not self.prodigal_output.is_file():
            raise FileNotFoundError(f"{self.prodigal_output} was not found.")
        for seq in sequence.read_fasta(self.prodigal_output):
            contig, gene, start, end, strand, rbs, code, gc = header_parser.match(
                seq.header
            ).groups()
            yield (
                contig,
                gene,
                int(start),
                int(end),
                int(strand),
                rbs,
                int(code),
                float(gc),
            )

JustinRaoV avatar Jul 03 '25 08:07 JustinRaoV