genomad
genomad copied to clipboard
How to export the viral DNA gene sequence?
Hi, In the example above I can see proteins FASTA file of GCF_009025895.1_virus_proteins.faa. I want to calculate the gene abundance with virus gene sequence.Can we output the corresponding nucleotide sequence?
Thanks!
There is currently no option to do this, but I could implement it as a feature in the future. In the meantime, you can obtain the nucleotide sequences of the CDSs by extracting them from the genomes using the gene coordinates.
I also had a seemingly simple question about whether I could format the output at taxonomy, such as converting it to k__; p__; c__; o__; f__; g__; s__.
Thanks!
You can use taxopy for that. geNomad's taxdump is inside the database directory, and you can find the TaxIds in the <prefox>_annotate/<prefox>_taxonomy.tsv file.
For instance:
import taxopy
taxdb = taxopy.TaxDb(
nodes_dmp="genomad_db/nodes.dmp",
names_dmp="genomad_db/names.dmp",
keep_files=True
)
taxon = taxopy.Taxon(5797, taxdb)
for rank, name in reversed(taxon.ranked_name_lineage):
if name != "root":
print(f"{rank}__{name}")
realm__Duplodnaviria
kingdom__Heunggongvirae
phylum__Uroviricota
class__Caudoviricetes
order__Crassvirales
What's the difference between "Unclassified" and "Viruses;;;;;;" ?
"Unclassified" means that the genes in the sequence had no matches to markers with taxonomy information. "Viruses" means that the classification is uncertain at a high rank.
Hi, In the example above I can see proteins FASTA file of GCF_009025895.1_virus_proteins.faa. I want to calculate the gene abundance with virus gene sequence.Can we output the corresponding nucleotide sequence?
Thanks!
I had a very primitive idea. With the help of AI, I modified the prodigal.py file. I didn't carefully examine this code, but perhaps it can offer you some assistance
import multiprocessing.pool
import re
from pathlib import Path
import pyrodigal_gv
from genomad import sequence
_RC_TABLE = str.maketrans(
"ACGT",
"TGCA"
)
def reverse_complement(seq: str) -> str:
"""Return the reverse-complement of a DNA/RNA string (handles IUPAC codes)."""
return seq.translate(_RC_TABLE)[::-1]
_orf_finder = pyrodigal_gv.ViralGeneFinder(meta=True, mask=True)
def _predict_genes(seq):
return (seq.accession, _orf_finder.find_genes(seq.seq))
class Prodigal:
def __init__(self, input_file: Path, prodigal_output: Path) -> None:
self.input_file = input_file
self.prodigal_output = prodigal_output
def run_parallel_prodigal(self, threads: int) -> None:
contigs = list(sequence.read_fasta(self.input_file))
dna_lookup = {rec.accession: rec.seq for rec in contigs}
geneout_path = self.prodigal_output.with_suffix(".fna")
with (
multiprocessing.pool.Pool(threads) as pool,
open(self.prodigal_output, "w") as faa,
open(geneout_path, "w") as fna,
):
for seq_idx, (seq_acc, predicted_genes) in enumerate(
pool.imap(_predict_genes, contigs, chunksize=5), 1
):
dna = dna_lookup[seq_acc]
for gene_idx, g in enumerate(predicted_genes, 1):
header = (
f"{seq_acc}_{gene_idx} # {g.begin} # {g.end} # {g.strand} # "
f"ID={seq_idx}_{gene_idx};"
f"partial={int(g.partial_begin)}{int(g.partial_end)};"
f"start_type={g.start_type};rbs_motif={g.rbs_motif};"
f"rbs_spacer={g.rbs_spacer};"
f"genetic_code={g.translation_table};gc_cont={g.gc_cont:.3f}"
)
nuc_seq = dna[g.begin - 1 : g.end] # 1-based -> 0-based
if g.strand == -1:
nuc_seq = reverse_complement(nuc_seq)
fna.write(str(sequence.Sequence(header, nuc_seq)))
prot_seq = g.translate(include_stop=False)
faa.write(str(sequence.Sequence(header, prot_seq)))
def proteins(self):
header_parser = re.compile(
r"(.+)_(.+) # ([0-9]+) # ([0-9]+) # (-1|1) .+rbs_motif=(.+?)"
r";.+;genetic_code=(.+?);gc_cont=(.+)"
)
if not self.prodigal_output.is_file():
raise FileNotFoundError(f"{self.prodigal_output} was not found.")
for seq in sequence.read_fasta(self.prodigal_output):
contig, gene, start, end, strand, rbs, code, gc = header_parser.match(
seq.header
).groups()
yield (
contig,
gene,
int(start),
int(end),
int(strand),
rbs,
int(code),
float(gc),
)