circlator
circlator copied to clipboard
fixstart: no suitable promer matches found (due to deletion in dnaA sequence of assembler output)
I've hit what's probably a rare corner case. Circularization completed successfully, but the start position of the chromosome was not set properly. I checked the logs for fixstart and this is what I see:
==> 06.fixstart.detailed.log <==
No suitable promer matches found
Using the following prodigal predictions to circularize contigs:
scf7180000000004|quiver Prodigal_v2.6.1 CDS 2205287 2207290 249.2 + 0 ID=1_2144;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.678;conf=99.99;score=249.24;cscore=242.85;sscore=6.39;rscore=1.96;uscore=0.46;tscore=3.96;
==> 06.fixstart.log <==
[fixstart] id break_point gene_name gene_reversed new_name skipped
[fixstart] scf7180000000004|quiver 2205287 prodigal no - -
When I BLAST my organism's reference dnaA sequence against my assembly, I notice that my assembly's version is one base shorter. I suspect that there's a frameshift deletion in the assembly that mangles the translated protein sequence, preventing promer
from making any matches. The deletion may be false and buff out during assembly polishing, but it still caused trouble here.
Thanks, yes we know that this can happen. I could add in an option to not require a full length match to the reference gene? I'm thinking require the start of the gene to match, as I'd still want to have the assembly start fixed to be the methionine at the start of the gene. But relax the requirement to have a promer match all the way to the end of the gene.
Would that be helpful to you?
I'm not sure it would. My 06.fixstart.promer.promer
look like this:
PROMER
[S1] [E1] [S2] [E2] [LEN 1] [LEN 2] [% IDY] [% SIM] [% STP] [LEN R] [LEN Q] [FRM] [TAGS]
3294129 3294689 1524 964 561 561 98.93 98.93 1.07 4414018 1524 3 -1 scf7180000000004|quiver gi|448814763:1-1524
3294130 3294681 1523 972 552 552 100.00 100.00 2.17 4414018 1524 1 -2 scf7180000000004|quiver gi|448814763:1-1524
3294131 3294694 1522 959 564 564 98.40 99.47 5.32 4414018 1524 2 -3 scf7180000000004|quiver gi|448814763:1-1524
3294675 3295649 977 3 975 975 99.69 99.69 2.46 4414018 1524 3 -2 scf7180000000004|quiver gi|448814763:1-1524
3294676 3295650 976 2 975 975 99.69 100.00 7.08 4414018 1524 1 -3 scf7180000000004|quiver gi|448814763:1-1524
3294679 3294131 974 1522 549 549 100.00 100.00 2.19 4414018 1524 -1 2 scf7180000000004|quiver gi|448814763:1-1524
3294680 3295651 972 1 972 972 100.00 100.00 1.23 4414018 1524 2 -1 scf7180000000004|quiver gi|448814763:1-1524
3294680 3294129 973 1524 552 552 100.00 100.00 0.54 4414018 1524 -3 1 scf7180000000004|quiver gi|448814763:1-1524
3294759 3294130 888 1523 630 636 89.15 91.51 3.54 4414018 1524 -2 3 scf7180000000004|quiver gi|448814763:1-1524
3295649 3294678 3 974 972 972 100.00 100.00 4.94 4414018 1524 -3 3 scf7180000000004|quiver gi|448814763:1-1524
3295650 3294676 2 976 975 975 99.69 99.69 1.85 4414018 1524 -2 2 scf7180000000004|quiver gi|448814763:1-1524
3295651 3294677 1 975 975 975 100.00 100.00 0.00 4414018 1524 -1 1 scf7180000000004|quiver gi|448814763:1-1524
I think it would be difficult to get a good cutoff. Also, there would still be the problem of a frameshift close to the beginning of the gene that would almost entirely change the amino acid sequence.
I already fixed this particular case using my pre-circlator script, which used blastn to find the start position and sequence orientation. A nucleotide search (rather than an amino acid search) makes more sense to me, but this would significantly change your algorithm. Then the sequence position that corresponds to the first base of the start codon (even if it's a mismatch) could be set to the start position.