methylseq icon indicating copy to clipboard operation
methylseq copied to clipboard

Complexity curve

Open bazyliszek opened this issue 5 years ago • 5 comments

  1. Looking into complexity curve. Could it be that for PE reads we actually need to add the -P parameter in preseq? This was not detected in the pipeline automatically.
samtools sort 6-10A_S13_L001_R1_001_val_1_bismark_bt2_pe.bam \
         -m 8589934592 \
         -@ 1 \
         -o 6-10A_S13_L001_R1_001_val_1_bismark_bt2_pe.sorted.bam
     preseq lc_extrap -v -B 6-10A_S13_L001_R1_001_val_1_bismark_bt2_pe.sorted.bam -o 6-10A_S13_L001_R1_001_val_1_bismark_bt2_pe.ccurve.txt 
  1. Also, for very small amounts of the reads this program is not working, but I guess that is my problem with reads of total number 119,573. Error: max count before zero is less than min required cound (4), sample not sufficiently deep or duplicates removed.

original command:

BAM_INPUT
TOTAL READS     = 119573
DISTINCT READS  = 119456
DISTINCT COUNTS = 3
MAX COUNT       = 3
COUNTS OF 1     = 119342
MAX TERMS       = 2
OBSERVED COUNTS (4)
1       119342
2       111
3       3

ERROR:  max count before zero is les than min required count (4), sample not sufficiently deep or duplicates removed

and if I implement PE:

PAIRED_END_BAM_INPUT
paired = 119572
unpaired = 0
MERGED PAIRED END READS = 119572
MATES PROCESSED = 239144
TOTAL READS     = 119572
DISTINCT READS  = 119485
DISTINCT COUNTS = 2
MAX COUNT       = 2
COUNTS OF 1     = 119398
MAX TERMS       = 2
OBSERVED COUNTS (3)
1       119398
2       87

ERROR:  max count before zero is les than min required count (4), sample not sufficiently deep or duplicates removed

bazyliszek avatar May 21 '19 09:05 bazyliszek

Ooh, yes - this goes a long way back. Basically, running with -P makes preseq fail a lot. This became very frustrating so I just removed it and made it run in single-end mode all of the time. The shapes of the curves are still correct, but I agree that it's not very clear and should be improved.

The problem with it failing with low read counts is just down to preseq. Nothing that we can do about that from the pipeline I think sorry.

ewels avatar May 22 '19 14:05 ewels

Ok, great! Thanks! I will not use -P for now.

bazyliszek avatar May 28 '19 07:05 bazyliszek

I've run into a similar issue and managed to fix it. The problem is that preseq requires a BED file as input. See here in the bedtools on how to support paired-end files.

# sort BAM file
samtools sort -O BAM 6-10A_S13_L001_R1_001_val_1_bismark_bt2_pe.bam > 6-10A_S13_L001_R1_001_val_1_bismark_bt2_pe.sorted.bam
# index  BAM file
samtools index 6-10A_S13_L001_R1_001_val_1_bismark_bt2_pe.sorted.bam
# convert to BED file with paired-ends (BEDPE format)
bamToBed -i 6-10A_S13_L001_R1_001_val_1_bismark_bt2_pe.sorted -bedpe >  6-10A_S13_L001_R1_001_val_1_bismark_bt2_pe.sorted.bed
# run in paired-end mode with -P
preseq lc_extrap -v -P 6-10A_S13_L001_R1_001_val_1_bismark_bt2_pe.sorted.bed -o 6-10A_S13_L001_R1_001_val_1_bismark_bt2_pe.sorted.ccurve.txt

Possibly addresses issue #161. Note that I'm using preseq version 3.0.2 installed from GitHub smithlabcode/preseq. It's possible that newer versions (since this repo uses version 2.0.3) address these problems.

TomKellyGenetics avatar Oct 27 '20 05:10 TomKellyGenetics

Just want to add that this issue has not been resolved yet. I am using 3.1 version and still facing the same issue when trying to run Preseq on Iseq run. I know the number of reads sequenced on Iseq are low and so will be the mapping reads in bam file. But it should run without error irrespective of that or fail quietly without disrupting nextflow pipeline.

Rohit-Satyam avatar Jan 10 '23 07:01 Rohit-Satyam

I am also facing the same issue even with bed file. Both bam and bed gives the same issue.

Nitin123-4 avatar Mar 03 '23 22:03 Nitin123-4