find_differential_primers Use read mapper as alternative to `PrimerSearch`

Use read mapper as alternative to `PrimerSearch`

Open widdowquinn opened this issue 5 years ago • 8 comments

It should be possible to speed up the in silico hybridisation step by using bwa or similar to map candidate primers. The output of this step could be made to resemble PrimerSearch output to minimise the extra effort required to process the result.

Nov 29 '18 13:11 widdowquinn

Cross reference https://github.com/widdowquinn/find_differential_primers/wiki/Emulating-PrimerSearch-using-Bowtie2

Sep 30 '21 15:09 peterjc

I've been using Jim Kent's stand alone isPcr tool for in silico hybridisation, available in bioconda:

https://anaconda.org/bioconda/ispcr

It does seem consistently much faster, but how much would depend on the typical FASTA file size used (first example here is 4.9MB, one bacterial chromosome), number of primers (here 141), and how strict you want this to be:

$ time primersearch -seqall references/NC_004547.2.fa -infile candidates_v1.tsv -mismatchpercent 10 -outfile /dev/stdout \
| grep "Amplimer length" -c
Search DNA sequences for matches with primer pairs
146

real	0m33.866s
user	0m33.359s
sys	0m0.233s

This drops to 141 results with a mismatch percentage of 9 or lower. The tool is a lot faster with -mismatchpercent 0,

$ time primersearch -seqall references/NC_004547.2.fa -infile candidates_v1.tsv -mismatchpercent 1 -outfile /dev/stdout \
| grep "Amplimer length" -c
Search DNA sequences for matches with primer pairs
141

real	0m16.334s
user	0m15.855s
sys	0m0.226s

However isPcr still wins:

$ time isPcr references/NC_004547.2.fa candidates_v1.tsv  /dev/stdout | grep -c "^>"
141

real	0m0.177s
user	0m0.103s
sys	0m0.021s

Timing best of 5 on the same machine.

The default settings in isPcr may be too strict for FDP's usage? It has no equivalent mismatch percentage setting, but can be made less strict.

Sep 30 '21 16:09 peterjc

Thanks @peterjc - that's a considerable speed-up, and worth considering as a drop-in replacement for PrimerSearch - it could have a significant effect on runtime.

A quick Google suggests it's not maintained:

exes: https://genome-test.gi.ucsc.edu/~kent/exe/
source: https://genome-test.gi.ucsc.edu/~kent/src/

and the documentation is the usage string. There doesn't appear to be an M1-compiled version yet, which is a potential question for future application/testing (but may be inconsequential - I don't have an M1 to test compilation on).

We're not treating PrimerSearch as a gold standard, so I don't think it's essential that the results are identical to ispcr (but if they were very different there would be other questions). So long as they each capture plausible cross-hybridisation their performance should be good enough to generate candidates for in vitro validation.

Sep 30 '21 17:09 widdowquinn

I recently emailed Jim Kent about a trivial bug (one of the documented options is not implemented), partly to see if he replies. Putting this in a positive light, isPcr is a mature stable tool ;)

I would expect the bioconda community to tackle M1 compilation, but have not looked into that.

Right now I'm running a larger test case - a much bigger genome and lots more primers. Here isPrc took 3s while primersearch is taking many minutes (bad enough that I had to submit it as a cluster job since the interactive session timed out). This demonstrates to my satisfaction that the timings above this isn't just a flat startup overhead.

Sep 30 '21 17:09 peterjc

Large genome example, 173 contigs in 116MB FASTA file, with 2634 candidate primers (only 2300 unique), simple test script:

#!/bin/bash
echo ispcr
for i in 1 2 3 4 5; do
    time isPcr reference.fasta candidates_v1.tsv benchmark.ispcr.txt
done
grep -c "^>" benchmark.ispcr.txt
echo primersearch
time primersearch -seqall reference.fasta -infile candidates_v1.tsv \
     -mismatchpercent 0 -outfile benchmark.ps.txt
grep "Amplimer length" -c benchmark.ps.txt

Best of 5 with ispcr:

Loaded 119585199 letters in 173 sequences

real	0m2.678s
user	0m2.342s
sys	0m0.162s
4389

Single test with primersearch:

Search DNA sequences for matches with primer pairs

real	91m36.578s
user	91m0.619s
sys	0m22.544s
5448

I had run this with primersearch on the same cluster (but likely a different node) and that run also took about 1h30. Note also this is with -mismatchpercent 0 which earlier testing suggested is faster as well as stricter.

So, ispcr did not return as many hits, but did so in less than 1/1000 of the time. While in the bacteria test, the time taken wasn't quite as dramatic needing approx 1/100 the time.

There are subtleties in the different output, e.g. if the genome contains ...F...F....R.... then it appears primersearch returns only the smaller amplicon F....R while ispcr returns F...F....R as well.

i.e. Switching in-silico PCR tool could be a non-trivial behavioural change, but worth considering if this is a computational bottleneck.

Sep 30 '21 22:09 peterjc

I recently emailed Jim Kent about a trivial bug (one of the documented options is not implemented), partly to see if he replies. Putting this in a positive light, isPcr is a mature stable tool ;)

😁 That is true!

I would expect the bioconda community to tackle M1 compilation, but have not looked into that.

I didn't go so far as to inspect the recipe to see if there's a (re-)compilation, or if the executable is pulled down as-is.

Oct 01 '21 08:10 widdowquinn

Large genome example, 173 contigs in 116MB FASTA file, with 2634 candidate primers (only 2300 unique), simple test script: Best of 5 with ispcr:

real	0m2.678s
user	0m2.342s
sys	0m0.162s

Single test with primersearch:

real	91m36.578s
user	91m0.619s
sys	0m22.544s
5448

That is a striking speed increase!

Note also this is with -mismatchpercent 0 which earlier testing suggested is faster as well as stricter.

I have no problem with having ispcr as an option (maybe along with other tools/approaches, too). The speed increase alone will make some jobs tractable - especially on limited hardware - that would not otherwise be accessible.

So, ispcr did not return as many hits, but did so in less than 1/1000 of the time. While in the bacteria test, the time taken wasn't quite as dramatic needing approx 1/100 the time.

In my experience the problem we've had - even with PrimerSearch looking for the potential to amplify some sites with "wobble" in the match - is that we under-predict the extent of cross-amplification. The balance to hit here is between speed of computation and waste of time/material in testing candidates primer sets that could be excluded with less effort. Generally, computation is cheap and you can get on with other things while the code runs. Lab staff and consumables are more expensive in terms of time and cash.

With that in mind a mismatch of zero is, more or less, equivalent to a grep for the primer sequences. We would naturally expect this to be computationally efficient. To the extent that it always finds cross-hybridisation with exact matches, it can rule out some primer sets. But it is likely to let through a number of sets that amplify despite having numerous mismatches ("false positives", in the sense of a "positive" being a discriminatory primer set).

The computational challenge is more difficult as we try to more closely approximate the in vitro behaviour of the primers, and reduce our "false positives" closer to zero.

I'd like the balance to be in the hands of the user so, as I say, I've no objection to having several options for the cross-hybridisation detection, including ispcr (with zero mismatches if the user wants it), but I felt I should make the case why faster isn't necessarily better.

I imagine that a deep-learning classifier, trained on a sufficient number of in vitro examples would do a much better job than either tool… I wonder if anyone would fund us to do that? ;)

There are subtleties in the different output, e.g. if the genome contains ...F...F....R.... then it appears primersearch returns only the smaller amplicon F....R while ispcr returns F...F....R as well.

It should be possible to restrict the maximum amplicon length with ispcr. When doing metabarcoding there's a natural restriction on the expected amplicon size that conveniently lets us set a hard limit.

i.e. Switching in-silico PCR tool could be a non-trivial behavioural change, but worth considering if this is a computational bottleneck.

I've come round to the view that giving the user the option to use their choice of tool and prioritise either speed or accuracy according to their means and needs is a better option than switching. But ispcr would clearly massively speed up the process and, if you've a robotised lab, you might not mind testing all the candidates a slower tool would rule out.

Oct 01 '21 08:10 widdowquinn

Indeed. And isPcr does have a maximum size setting (there is a documented but not implemented minimum size too). It has some fuzziness settings but not a percentage mismatch like primersearch.

Oct 01 '21 08:10 peterjc

find_differential_primers find_differential_primers copied to clipboard

Use read mapper as alternative to `PrimerSearch`

find_differential_primers
find_differential_primers copied to clipboard