kaiju icon indicating copy to clipboard operation
kaiju copied to clipboard

custom database creation question

Open Valentin-Bio-zz opened this issue 2 years ago • 1 comments

Hello I want to build my own database (GTDB + own built MAGs). I used prodigal to convert my nucleotide fasta files to protein fasta files. As I see prodigal assigns as first column of a fasta header the contig name that the assembler outputs. after the first column appears infor regarding prodigal functionality. The issue is that for building a kaiju custom database its necessary to sustitute the protein fasta headers with NCBI protein taxon identifier numbers. Should I do this buildijng my own script to assign the NCBI protein taxon identifiers?

Lets say that one protein fasta header is the following one:

k141_811263_4 # 1653 # 1775 # -1 # ID=1_4;partial=00;start_type=ATG;rbs_motif=AGGA;rbs_spacer=5-10bp;gc_cont=0.228

Here k141_811263_4 corresponds to the genome identifier. the "_4" substring its to the contig number of the draft genome.

the genome k141_811263 has been previously classified by GTDB and there is taxonomic information about the genome (classified by domain, phyla, clase, order, family, genus. species)

So I have to extract that classification info and match it with the NCBI taxon identifier number?

Valentin-Bio-zz avatar Dec 13 '21 21:12 Valentin-Bio-zz

Yes, kaiju expects the NCBI taxon identifier in the sequence name. Maybe there is already a mapping somehwere between GTDB taxon names and NCBI taxon IDs that can be used..

pmenzel avatar Dec 14 '21 08:12 pmenzel