SimSeq icon indicating copy to clipboard operation
SimSeq copied to clipboard

Overestimated error rate

Open ghost opened this issue 13 years ago • 4 comments
trafficstars

Hello, I have now proceeded to the synthetic read generation itself. However, when checking my alignment with 'samtools tview', the reads present approximately 30 mismatches each, for a total of 76 bp length. Once these synthetic reads are converted to FASTQ format and then realigned using BWA, they mostly end up as unaligned records. I have tried tweaking the parameters of SimSeq, so far to no avail. The original BAM file is free of duplicates and has a much lower error rate, although perhaps some regions have a high error rate due to structural events.

In order to generate the alignment I did the following: -estimate the error profile for an existing BAM file using getErrorProfile -generate a modified sequence according to my own needs -feed that sequence and the error profile into SimSeq -add headers to the generated SAM file -converting the SAM to BAM -sort the BAM -use tview to examine the alignment against the modified sequence

I also went through the ultimate steps once (SamToFastq then BWA).

The command-line I used for SimSeq is as follows: java -jar -Xmx2048m SimSeq.jar -1 76 -2 76 -e ep.txt -l 5500 -m -n 64000000 -o synth.sam --phred64 -r modified.fa --mate_pulldown_error_p 0.7

I tried several values for the mate pulldown error, but it didn't help. My guess is that the values in ep.txt are overestimated for some reason. Is there any way to either control how they are used by the program, or to lower them in a rational way to a more adequate number? Also, you mentioned something about the values being single-position error probabilities and not conditional ones. Perhaps this could be the cause?

Thanks in advance,

A./

ghost avatar Jul 06 '12 15:07 ghost

The conditional thing wouldn't cause the issue. You are building the error profile based on reads that align so the resulting error rate shouldn't be any worse than the original error rate. The conditional problem is just that when errors show up, it is in unrealistic ways. For example, in real reads if you have an error and bad phred score at the N-2 position of a read, it is likely that the N-1 position has the same problems. Errors and bad phred scores tend to either cluster at the ends of reads, or not be there at all. Either the entire last part of a read looks like junk, or it looks pretty good, you don't usually see junky bases mixed in with good bases. Right now those two positions are treated independently of each other although they are conditioned based on the position itself. For example if you have a 10% chance of error at position X given phred score Y, then that is what you should see averaged across all instances of that position and phred score in your simulated data, but the next position doesn't take the previous into account when choosing it's phred score and corresponding probability of error.

Could you set up a small test for me to check out?

On Jul 6, 2012, at 8:39 AM, de-a wrote:

Hello, I have now proceeded to the synthetic read generation itself. However, when checking my alignment with 'samtools tview', the reads present approximately 30 mismatches each, for a total of 76 bp length. Once these synthetic reads are converted to FASTQ format and then realigned using BWA, they mostly end up as unaligned records. I have tried tweaking the parameters of SimSeq, so far to no avail. The original BAM file is free of duplicates and has a much lower error rate, although perhaps some regions have a high error rate due to structural events.

In order to generate the alignment I did the following: -estimate the error profile for an existing BAM file using getErrorProfile -generate a modified sequence according to my own needs -feed that sequence and the error profile into SimSeq -add headers to the generated SAM file -converting the SAM to BAM -sort the BAM -use tview to examine the alignment against the modified sequence

I also went through the ultimate steps once (SamToFastq then BWA).

The command-line I used for SimSeq is as follows: java -jar -Xmx2048m SimSeq.jar -1 76 -2 76 -e ep.txt -l 5500 -m -n 64000000 -o synth.sam --phred64 -r modified.fa --mate_pulldown_error_p 0.7

I tried several values for the mate pulldown error, but it didn't help. My guess is that the values in ep.txt are overestimated for some reason. Is there any way to either control how they are used by the program, or to lower them in a rational way to a more adequate number? Also, you mentioned something about the values being single-position error probabilities and not conditional ones. Perhaps this could be the cause?

Thanks in advance,

A./


Reply to this email directly or view it on GitHub: https://github.com/jstjohn/SimSeq/issues/6

jstjohn avatar Jul 06 '12 16:07 jstjohn

Hello, here is the download link http://www.2shared.com/file/QDIgNBI5/exampletar.html

The tar file features a part of the original BAM alignment (1Mb on chr1), the error profile generated from it (.ep), the reference (.fa) and finally, a sorted & headed BAM file built from the SimSeq SAM output (synth.*.bam). The file is 110M total.

ghost avatar Jul 10 '12 13:07 ghost

Hello, a colleague of mine hinted that perhaps there is some misinterpretation of the quality scores going on in the first step (i.e. Sequencing Qality/Illumina 1.3/Illumina 1.5). Could this be the case? Tell me if you need me to provide you with some test material.

A./

ghost avatar Jul 17 '12 09:07 ghost

Unfortunately I am not going to have time to debug this. However I have uploaded the previous version to the Downloads page for you to try as well.

jstjohn avatar Oct 11 '12 14:10 jstjohn