conterminator icon indicating copy to clipboard operation
conterminator copied to clipboard

fail to predict inserted contamination

Open felipevzps opened this issue 4 years ago • 5 comments

Hello!

I did a synthetic genome to check the outputs and the conterminator failed to predict inserted contaminants.

Infos: Version: 1.c74b5 Organisms in this synthetic genome: Saccharum hybrid cultivar SP80-3280, Klebsiella pneumoniae and Acinetobacter baumannii.

History I inserted the complete A.baumanii and K.pneumoniae genome into the sugarcane genome and created a kraken mapping file (when I checked the mapping file, I could see the ID taxonomy of the inserted items - A.baumani ID = 470, K.pneumoniae ID = 573 and SP80-3280 ID = 193079).

Then, I ran the conterminator with the following command: conterminator dna synthetic_genome.fasta kraken_mapping_file.txt synthetic_genome_conterminator tmp

Results The synthetic_genome_conterminator_conterm_prediction is empty. The synthetic_genome_conterminator_all don't have informations of the inserted contaminants.

Data synthetic_genome_conterminator_all.txt kraken_mapping_file.txt Genome file is too big and the conterm_prediction is empty.

Problem My objective is to observe contamination in the sugarcane genome. I'm using the conterminator incorrectly or is the conterminator failing to predict contamination?

felipevzps avatar Jul 31 '20 09:07 felipevzps

We currently predict contamination just for shore sequences of length < 20kb. The 20kb can be in scaffolds or just single sequences. I assume you have just one long sequence?

martin-steinegger avatar Aug 01 '20 05:08 martin-steinegger

@martin-steinegger Is there a way to indicate that contamination should be reported for longer sequences? I'm trying to reproduce the example between C. elegans and E. coli in your ms.

donovan-h-parks avatar Nov 09 '21 00:11 donovan-h-parks

The _all report should contain all the local alignments with cross kingdom hits (--kingdom). This could be used to filter for longer sequences. Can you find the C.elegans and E.coli in it? The format is like the following:

1.) Numeric identifier
2.) Sequence identifier
3.) Alignment start
4.) Alignment end
5.) Corrected contig length (length between flanking Ns)
6.) Total sequence length
7.) Kingdom (default: 0: Bacteria&Archaea, 1: Fungi, 2: Metazoa, 3: Viridiplantae, 4: Other Eukaryotes)
8.) Species name 

martin-steinegger avatar Nov 09 '21 09:11 martin-steinegger

There are indeed expected hits in the _all file. Is it possible to make the 20 kb filtering criterion an exposed parameter? This would also help document to users that such a criterion exists.

donovan-h-parks avatar Nov 09 '21 15:11 donovan-h-parks

Yes, I agree. I had this on my todo list for quite some time. :( But currently I am quite flooded with work.

martin-steinegger avatar Nov 09 '21 15:11 martin-steinegger