poly
poly copied to clipboard
Example Primer Development Workflow

Getting the ball rolling on implementing an example primer workflow that satisfies the following in order to generate a large number of DNA primers:
- [x] consistent GC content
- [x] roughly the same melting temperature
- [ ] don't dimerize with themselves or other primers in the set
- [x] no shared subsequences >4bp with any other primer in the set
- [ ] does not bind to any DNA sequences from a reference (e.g. FreeGenes library and the genomes of E.coli, B.subtilis, S.cerevisiae, and P.pastoris)
regarding: Dimerization
I'm not sure what the condition for no dimerization ought to be; if it is that no two sequences are reverse complements of each other, then this condition is satisfied by the "no shared subsequences >4bp"
in order to accomplish this in a meaningful way for use by practicing biologists, this probably needs:
(a) some function that describes the binding energy of two primers,
(b) CreateBarcodesWithBannedSequences to check new primers against the existing primers it has generated in the ongoing function call (the current bannedFunctions arg only considers the new primer, not the new primer as it relates to the existing set)
(c) a "bannedFunction" that sets the threshold for what probability/frequency of dimerizing is acceptable
regarding: Doesn't Bind to Reference Sequences
hopefully is straightforward iteration over the genomes, but there's definitely room for a better(=faster) method since iterating over multiple Mbp genomes could easily add up to take a while
Thanks for getting the ball rolling @codercahol!
@eyesmo since you requested this can you help review it?
Excellent. Preliminary thoughts:
- Set target_temp to 60.0
- Set target_GC to 50.0, with a margin of 10.0
- I would think there should be a way to use the LinearFold algorithm @v-raja built into Poly (I believe @isaacguerreir also has experience working with this) to calculate how energetically favorable hairpins, homo- and hetero-dimers are within the primer set
- For scanning genomes for matching sequences, it might be good to figure out how much different types of combinations of individual base mismatches destabilize the LinearFold-predicted dimerization energy, up to a given threshold (say, a 10C drop in predicted melting temperature, relative to a perfect match). In other words, how many mismatches across the whole primer sequence are required to destabilize the duplex enough that it wouldn't anneal significantly at the primer's designed melting temperature? Then, in addition to searching the 'background' sequence space for perfect substring matches of length L (where L is probably >10), also search the 'background' for substring matches of length K (K<L, e.g. K=5 or 6), pull the local 'background' sequence (e.g. 15 bases on either side of the match) around the Kmer matches, and do one of two things: (1) do a sequence similarity check on the pulled sequences (counting up mismatches and indels relative to the primer that matched the K-subsequence, and eliminating/mutating the primer if there are too few mismatches [i.e. below the threshold LinearFold generally predicts would destabilize/reduce the melting temperature by 10C]); or (2), just directly run LinearFold on the suspect primer paired with every local 'background' sequence surrounding a Kmer match, eliminating or mutating any primers predicted to bind one of these sequences with a melting temp within 10 degrees C of the target melting temp.