conterminator
conterminator copied to clipboard
fail to predict inserted contamination
Hello!
I did a synthetic genome to check the outputs and the conterminator failed to predict inserted contaminants.
Infos: Version: 1.c74b5 Organisms in this synthetic genome: Saccharum hybrid cultivar SP80-3280, Klebsiella pneumoniae and Acinetobacter baumannii.
History I inserted the complete A.baumanii and K.pneumoniae genome into the sugarcane genome and created a kraken mapping file (when I checked the mapping file, I could see the ID taxonomy of the inserted items - A.baumani ID = 470, K.pneumoniae ID = 573 and SP80-3280 ID = 193079).
Then, I ran the conterminator with the following command:
conterminator dna synthetic_genome.fasta kraken_mapping_file.txt synthetic_genome_conterminator tmp
Results The synthetic_genome_conterminator_conterm_prediction is empty. The synthetic_genome_conterminator_all don't have informations of the inserted contaminants.
Data synthetic_genome_conterminator_all.txt kraken_mapping_file.txt Genome file is too big and the conterm_prediction is empty.
Problem My objective is to observe contamination in the sugarcane genome. I'm using the conterminator incorrectly or is the conterminator failing to predict contamination?
We currently predict contamination just for shore sequences of length < 20kb. The 20kb can be in scaffolds or just single sequences. I assume you have just one long sequence?
@martin-steinegger Is there a way to indicate that contamination should be reported for longer sequences? I'm trying to reproduce the example between C. elegans and E. coli in your ms.
The _all
report should contain all the local alignments with cross kingdom hits (--kingdom). This could be used to filter for longer sequences. Can you find the C.elegans and E.coli in it? The format is like the following:
1.) Numeric identifier
2.) Sequence identifier
3.) Alignment start
4.) Alignment end
5.) Corrected contig length (length between flanking Ns)
6.) Total sequence length
7.) Kingdom (default: 0: Bacteria&Archaea, 1: Fungi, 2: Metazoa, 3: Viridiplantae, 4: Other Eukaryotes)
8.) Species name
There are indeed expected hits in the _all
file. Is it possible to make the 20 kb filtering criterion an exposed parameter? This would also help document to users that such a criterion exists.
Yes, I agree. I had this on my todo list for quite some time. :( But currently I am quite flooded with work.