motifBreakR
motifBreakR copied to clipboard
Odd Motif Positions
tl;dr: I find many motifPos that imply TF binds completely outside the variant range (start AND end are negative), but also it starts after it ends (end < start)? Also, motif length != end-start.
I got this result from running motifBreakeR on a list of variants that are 12bp substitution (a bit odd, maybe why this happens):
chr8 48582443 48582454 12 + mutant CTGTTGTTATTC TCACCACCGCCT Other c(-6, -9) MSANTD3 jaspar2022 MA1523.1 MA1523.1 gctctccacTCACCACCGCCTtggtcccaa 0.6356006 0.9892791 4.870919 7.465608 NA NA 1:12 2.5946893 0.34392895 strong
Note motifPos: c(-6,9)
I think start is correct (in this and similar cases), but end should simply be start+motifLen.
It seems the bug lies here:
if ((mresult$varType == "Insertion" & score < 0) |
(mresult$varType == "Deletion" & score > 0)) {
motif.end <- motif.start + len
} else {
if (motif.start > 0) {
motif.end <- len - length(motif.start:length(alt_loc[1]:alt_loc[2]))
} else {
motif.end <- motif.start + len - length(alt_loc[1]:alt_loc[2])
}
}
I'm not completely sure what is calculated under else
, but it results in this weird bug. First case (motif.start+len
) would work well in here. I can't think of a case when, after establishing where motif starts in reference, the end wouldn't be motif.start+len. In any case may be worth throwing a warning if start>=end.
Now I'm not sure start is calculated correctly either.
E.g. Zfp410 motif below allegedly starts on the base before the variant and ends on position 15 (-1,15). This would make alt seq aTGGCCGAGATGGgat. But 1) motif length is 17bp, not 16bp and 2) best match for Zfp410 to alt is (5,21) CGAGATGGgatgatagg, which perfectly matches UniPROBE top k-mer ATGGGATG. So something is right, but something is wrong.
seqnames start end width strand SNP_id REF ALT varType motifPos geneSymbol
chr8 48582575 48582586 12 + LC14t17 CAATTAGAGCAA TGGCCGAGATGG Other -1, 15 Zfp410
dataSource providerName providerId
UniPROBE SCI09/Zfp410_pwm_primary.txt UP00033
seqMatch pctRef pctAlt scoreRef
gaatctggaaaTGGCCGAGATGGgatgataggc 0.5466476 0.9525102 6.257263
scoreAlt Refpvalue Altpvalue altPos alleleDiff alleleEffectSize effect
10.66372 NA NA 6 4.406453 0.3941613 strong
Firstly, yes you have exposed a bug, particularly on the Zfp410 example.
Your first example, however, I believe is working as intended.
I think I have it confusingly labeled, however because of the conditions that I am trying to describe, I'm not sure if I could do it differently.
The first number is the number of bases upstream (negative) or downstream (positive) of the start of the variant describing where the motif starts. The second number is the number of bases upstream or downstream of the end of the variant, describing where the motif ends.
In this figure, the motif is represented by the pink rectangle, and the insertion by the green rectangle, and the motif position is recorded to the right.
In your first example c(-6, -9)
the motif is six bp upstream from the start of the 12mer, and nine bp upstream from the end of the 12mer.
In your second example the two 12mers have a 3bp area of similarity and when I try to align the sequences to find out exactly where the motif is breaking the reference sequence, the alignment is produced like this:
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: TGGCC-GAGATGG
subject: CAATTAGAG-CAA
score: -69.24901
a deletion at 6bp and an insertion at 10bp, rather than failing as intended and marking all 12bp a substitution.
For this reason, its marking the position relative to this false 1bp deletion, rather than the 12bp substitution.
Sorry for the difficulty, I'll try get this sorted out ASAP.
Thanks for clarifying and sorry for bundling two issues in one! I see how this notation makes sense wrt indels. The figure you've attached does a great job of explaining that, could be useful having it in the vignette.
So it seems this is still not working as of motifbreakR_2.13.8. Note: same motif, different mutations (adjacent), same ref motif score and pctRef detected, but first one is 12bp long (7,6) and second is 23bp? (-10,1)
I've made a reprex, attached.