long-read-pipelines icon indicating copy to clipboard operation
long-read-pipelines copied to clipboard

Implement and evaluate a new long-read-specialized tandem repeat finder

Open kvg opened this issue 4 years ago • 1 comments

Implement a WDL task and evaluate the use of the noise-cancelling repeat finder (NCRF). From the manuscript (https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz484/5530597):

Abstract Summary Tandem DNA repeats can be sequenced with long-read technologies, but cannot be accurately deciphered due to the lack of computational tools taking high error rates of these technologies into account. Here we introduce Noise-Cancelling Repeat Finder (NCRF) to uncover putative tandem repeats of specified motifs in noisy long reads produced by Pacific Biosciences and Oxford Nanopore sequencers. Using simulations, we validated the use of NCRF to locate tandem repeats with motifs of various lengths and demonstrated its superior performance as compared to two alternative tools. Using real human whole-genome sequencing data, NCRF identified long arrays of the (AATGG)n repeat involved in heat shock stress response.

Availability and implementation NCRF is implemented in C, supported by several python scripts, and is available in bioconda and at https://github.com/makovalab-psu/NoiseCancellingRepeatFinder.

kvg avatar Sep 14 '19 01:09 kvg

I looked into this, and it's not really clear to me that NCRF is meant for use in the way we're expecting. It takes as input a fasta file, and it's not clear to me how to run this as a discovery tool. In fact, from the documentation:

Note that this is not intended to be a turnkey solution but more of an exploratory platform for the user. It will likely require parameter tweaking and experimentation on the part of the user, as well as some user-written programs to post-process the output.

I don't think this is what we want. I think we're going to need to look more expansively at specialized repeat expansion/contraction callers for long read data.

kvg avatar Oct 12 '20 15:10 kvg