crisprseq icon indicating copy to clipboard operation
crisprseq copied to clipboard

Understand above_error_rate and why it's always False for insertion

Open npatel-ah opened this issue 4 months ago • 10 comments

Hello,

Thank you for such a great pipeline; it's been really helpful for our analysis.

We've run a few hundred samples through different versions of the pipeline. When we reviewed it, we found that for all of the samples, the "above_error_rate" is always False for insertion (ins). Is this expected? We would appreciate it if you could provide a brief description of what "above_error_rate " stands for.

Regards

npatel-ah avatar Aug 26 '25 15:08 npatel-ah

Hello @npatel-ah,

I am glad this pipeline is being useful for you. "above_error_rate" means that the probability of having this insertion is higher than the error rate of sequencing. We calculate this using the phred score of the fastq files, which is related to the probability of the base being an error. The phred score or (Q) followis this formula:

Image

We calculate P (probability of error) of each base of the insertion. And then use the filter where the proportion of reads having this insertion (compared to the total number of reads), is higher than the mean P in the insertion: num insertion reads / num total reads > mean(P)

Could you revise the the percentage of each insertion and the quality of those reads? It would be good to discard if there is a bug in the pipeline.

Thanks!

mirpedrol avatar Aug 27 '25 09:08 mirpedrol

Thank you for the detailed explanation.

"Could you revise the percentage of each insertion and the quality of those reads?"

Do you mean to update the threshold? I do not specific option for it. Regarding the quality of reads, perhaps we can trim down low-quality end reads. I can try doing this.

npatel-ah avatar Aug 28 '25 16:08 npatel-ah

Sorry, I meant checking if the insertions you see happen in a high number of reads. You should be able to see this in the output files *_Insertions.html.

mirpedrol avatar Sep 01 '25 06:09 mirpedrol

Thanks again. I do notice that most of the insertions occur near the beginning and away from the cut sites. How can this be addressed? See example below, I do see the same consistently for most of the samples

Image

npatel-ah avatar Sep 03 '25 13:09 npatel-ah

Good catch :) This is usually due to sequencing erros at the beginning of the reads, so it makes sense that your indels are classified as above_error_rate False. You can trim some bases form the start of your reads, or trim by quality, either with cutadapt or seqtk. In the pipeline we are using cutadapt to trim adapters, we don't trim any extra bases. We then use seqtk to mask low quality bases, with the following arguments: -q 20 -L 80 -n N

  • bases with quality < 20 are masked with an N
  • read shorter than 80 bases are discarded

You can override this by providing custom config file, here is the nf-core documentation. For example, you could try removing the first 5 bp of your reads:

process {
    withName: CUTADAPT_FIVE_PRIME {
        ext.args = "-u 5 -g ${params.five_prime_adapter} ${params.extra_cutadapt_five_prime_args ?: ''}"
    }
}

Adding the -u 5, the rest of the arguments are the ones used in the pipeline.

Another option would be to lower the quality filter with seqtk (currently set to 20), depending on what you see in your reads.

I also copy the links to the seqtk documentation and [cutadapt documentation](https://cutadapt.readthedocs.io/en/stable/guide.html#cut-bases) for reference.

Could you please let me know if this sounds good to you and if it works? We don't have a specific step to trim bases in the pipeline, but it could be a good option to add one if this is a common issue.

mirpedrol avatar Sep 03 '25 14:09 mirpedrol

Thank you for the quick response! I think using a custom config file should be straightforward. ’ll give that a try and update

Also, for my reference sequence, I included the primer sequences on both ends. I’m not sure if that could be affecting the results, but I can also try removing them and rerunning the analysis if that’s recommended.

npatel-ah avatar Sep 03 '25 16:09 npatel-ah

Hello Again, I re-ran the analysis with trimming 5–10 bases, but it didn’t improve the results. Looking further into the data, I noticed a position with both a 1-base G deletion and a G insertion. Could it be related to merging or mapping artifacts?

Insertion

Image

Deletion

Image

npatel-ah avatar Sep 04 '25 15:09 npatel-ah

Here is one more from a completely different experiment. As you can notice rate of insertion is very low, even then NONE of the insertions were identified as "above_error_rate"? Would you be able to check the code to see if there is a possible bug?

Image

npatel-ah avatar Sep 04 '25 16:09 npatel-ah

Are you able to share the file of one of the samples so it is easier for me to try it out and debug? From this plots I would say your sample is not eddited, since you don't have any edition around the cut site, and all the insertions are located at the ends of the reads, which is usually sequencing artifacts. Is this what you are expecting from this sample?

mirpedrol avatar Sep 08 '25 09:09 mirpedrol

If you prefer, you can contact me on the nf-core slack and we can talk there

mirpedrol avatar Sep 08 '25 09:09 mirpedrol