Purecn 2.2.0 Coverage.R issues
Hi,
I am trying to run Coverage.R from the latest Purecn docker. The command that I am running is
Rscript $PURECN/Coverage.R --out-dir normaldb/ --bam bam.list --intervals xgen-exome-research-panel-probesbe255a1532796e2eaa53ff00001c1b3c.hg38.bed.gz --cores 30
The error I got is below:
INFO [2022-07-13 03:51:54] Loading PureCN 2.2.0...
INFO [2022-07-13 03:51:54] Using BiocParallel MulticoreParam backend with 30 cores.
INFO [2022-07-13 03:51:54] Processing /mnt/SCRATCH/purecn/purecn_coverage/1c854b02-e522-4db8-b223-a4232639a8e7_wxs_gdc_realn_coverage.txt.gz...
INFO [2022-07-13 03:51:55] Processing /mnt/SCRATCH/purecn/purecn_coverage/7a310be2-73f3-49db-b1c1-7b887fbc11d4_wxs_gdc_realn_coverage.txt.gz...
INFO [2022-07-13 03:51:55] Processing /mnt/SCRATCH/purecn/purecn_coverage/8a4d7bd5-3943-4121-a566-ec33c99f5a3a_wxs_gdc_realn_coverage.txt.gz...
INFO [2022-07-13 03:51:54] Processing /mnt/SCRATCH/purecn/purecn_coverage/11748118-4168-4894-b7ff-b9106ec56614_wxs_gdc_realn_coverage.txt.gz...
Stop worker failed with the error: wrong args for environment subassignment
Error: BiocParallel errors
2 remote errors, element index: 1, 2
9 unevaluated and other errors
first remote error:
Error in asMethod(object): The character vector to convert to a GRanges object must contain
strings of the form "chr:start-end" or "chr:start-end:strand", with end
>= start - 1, or "chr:pos" or "chr:pos:strand". For example:
"chr1:2501-2900", "chr1:2501-2900:+", or "chr1:740". Note that ".." is
a valid alternate start/end separator. Strand can be "+", "-", "*", or
missing.
Execution halted
It might suggest that my baits interval has invalid format against Purecn. Here is the format from my bait file.
chr1 69069 69189 469_378975_79501(OR4F5)_1_1 0 +
If it is the issue of format, I'm wondering if we have some documents on the format of intervals, and if it's not, do you have insight what might cause the issue?
After the failure, I tried to run GATK4 DepthOfCoverage and then convert the output by Coverage.R. I am able to run the GATK4 and get the sample_interval_summary. However, I'm unable to convert it by the Coverage.R.
Here is the command I used:
Rscript $PURECN/Coverage.R --out-dir purecn_coverage/ --coverage 11748118-4168-4894-b7ff-b9106ec56614_wxs_gdc_realn_depofcov/11748118-4168-4894-b7ff-b9106ec56614_wxs_gdc_realn_depofcov.sample_interval_summary --intervals xgen-exome-research-panel-probesbe255a1532796e2eaa53ff00001c1b3c.hg38.bed.gz
The error I got is:
INFO [2022-07-11 13:46:42] Loading PureCN 2.2.0...
WARN [2022-07-11 13:46:55] Coverage data contains single nucleotide intervals.
WARN [2022-07-11 13:46:55] Found 22910 overlapping intervals, starting at line 38.
FATAL [2022-07-11 13:46:57] Provided coverage is zero, most likely due to a corrupt BAM file.
FATAL [2022-07-11 13:46:57]
FATAL [2022-07-11 13:46:57] This is most likely a user error due to invalid input data or
FATAL [2022-07-11 13:46:57] parameters (PureCN 2.2.0).
Error: Provided coverage is zero, most likely due to a corrupt BAM file.
This is most likely a user error due to invalid input data or
parameters (PureCN 2.2.0).
In addition: Warning message:
In max(raw$average.coverage[raw$on.target], na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
Execution halted
Not sure what I did wrong for the GATK4 part, but for your information, the GATK4 command that I ran is:
gatk DepthOfCoverage -R GRCh38.d1.vd1.fa -O 11748118-4168-4894-b7ff-b9106ec56614_wxs_gdc_realn_depofcov/11748118-4168-4894-b7ff-b9106ec56614_wxs_gdc_realn_depofcov -I 11748118-4168-4894-b7ff-b9106ec56614_wxs_gdc_realn.bam -L xgen-exome-research-panel-probesbe255a1532796e2eaa53ff00001c1b3c.hg38.bed.gz
Please let me know if you need additional information. Thanks!
Sorry for the late response. I would recommend using the baits BED file, not the BED file with the targeted exons. The baits BED should be more even in width and should not contain single basepair intervals (usually all should be 80-120bp or multiples of it).
If you use GATK4, make sure the intervals are not changed (i.e. merged). There is an option for that.
Thanks for your reply. For the baits BED file, normally what we get are hg19 bed files, and we do liftover to hg38. So even if it was a bait file with larger intervals, liftover will generate some smaller, even single base intervals, and they will be merged after liftover. It looks like the blocker is that we have uneven/single bp intervals. What would you recommend to process the intervals:
- Remove some short intervals. We can remove interval of 1 base, or say intervals less than 5 bases, etc. Is there any suggestions on the cutoff?
- Add padding to all intervals, given that most of the bed files we get are capture region file instead of bait file.
For the GATK4, I believe I've been using the same interval BED file. Just wondering which option you were referring to. From the GATK or Purecn?
Just want to update that I fixed issue by re-run the intervalFile.R.