kaiju
kaiju copied to clipboard
custom database creation question
Hello I want to build my own database (GTDB + own built MAGs). I used prodigal to convert my nucleotide fasta files to protein fasta files. As I see prodigal assigns as first column of a fasta header the contig name that the assembler outputs. after the first column appears infor regarding prodigal functionality. The issue is that for building a kaiju custom database its necessary to sustitute the protein fasta headers with NCBI protein taxon identifier numbers. Should I do this buildijng my own script to assign the NCBI protein taxon identifiers?
Lets say that one protein fasta header is the following one:
k141_811263_4 # 1653 # 1775 # -1 # ID=1_4;partial=00;start_type=ATG;rbs_motif=AGGA;rbs_spacer=5-10bp;gc_cont=0.228
Here k141_811263_4 corresponds to the genome identifier. the "_4" substring its to the contig number of the draft genome.
the genome k141_811263 has been previously classified by GTDB and there is taxonomic information about the genome (classified by domain, phyla, clase, order, family, genus. species)
So I have to extract that classification info and match it with the NCBI taxon identifier number?
Yes, kaiju expects the NCBI taxon identifier in the sequence name. Maybe there is already a mapping somehwere between GTDB taxon names and NCBI taxon IDs that can be used..