IntervalFile.R command line behaviour different than preprocessIntervals()
Describe the issue
Case 1: IntervalFile.R does not have an equivalent for the off.target.padding argument in preprocessIntervals()
Case 2: preprocessIntervals() does not keep the Gene column from a GRanges object specified in interval.file.
To Reproduce
Case 2:
bed.file <- system.file("extdata", "ex2_intervals.bed",
package = "PureCN", mustWork = TRUE)
reference.file <- system.file("extdata", "ex2_reference.fa",
package = "PureCN", mustWork = TRUE)
intervals <- import(bed.file)
#dummy gene names
intervals$Gene <- letters[1:5]
preprocessIntervals(interval.file = intervals,
reference.file = reference.file,
output.file = "gc_file.txt"
)
gc_file.txt contents, have a point where the gene name should be:
@HD VN:1.6
@SQ SN:seq1 LN:800
@SQ SN:seq2 LN:800
Target gc_bias mappability reptiming Gene on_target
seq1:101-250 0.453333333333333 NA NA . TRUE
seq1:301-650 0.505714285714286 NA NA . TRUE
seq2:1-150 0.573333333333333 NA NA . TRUE
seq2:401-550 0.48 NA NA . TRUE
seq2:676-775 0.41 NA NA . TRUE
Expected behavior
Case 2: In addition to the above, the command line IntervalFile.R automatically adds annotation, I would like to use custom gene names per bait.
Session Info
R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.5 LTS
Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_CH.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=de_CH.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=de_CH.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_CH.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] PureCN_2.2.0 VariantAnnotation_1.42.1 Rsamtools_2.12.0 Biostrings_2.64.0
[5] XVector_0.36.0 SummarizedExperiment_1.26.1 Biobase_2.56.0 GenomicRanges_1.48.0
[9] GenomeInfoDb_1.32.2 IRanges_2.30.0 S4Vectors_0.34.0 MatrixGenerics_1.8.0
[13] matrixStats_0.62.0 BiocGenerics_0.42.0 DNAcopy_1.70.0
loaded via a namespace (and not attached):
[1] httr_1.4.3 VGAM_1.1-7 bit64_4.0.5 splines_4.2.1 assertthat_0.2.1
[6] BiocFileCache_2.4.0 blob_1.2.3 BSgenome_1.64.0 GenomeInfoDbData_1.2.8 yaml_2.3.5
[11] progress_1.2.2 pillar_1.7.0 RSQLite_2.2.14 lattice_0.20-45 glue_1.6.2
[16] digest_0.6.29 RColorBrewer_1.1-3 colorspace_2.0-3 Matrix_1.5-1 XML_3.99-0.9
[21] pkgconfig_2.0.3 biomaRt_2.52.0 zlibbioc_1.42.0 purrr_0.3.4 scales_1.2.0
[26] BiocParallel_1.30.2 tibble_3.1.7 KEGGREST_1.36.0 ggplot2_3.3.6 generics_0.1.2
[31] ellipsis_0.3.2 cachem_1.0.6 GenomicFeatures_1.48.1 cli_3.3.0 mclust_5.4.10
[36] magrittr_2.0.3 crayon_1.5.1 memoise_2.0.1 fansi_1.0.3 xml2_1.3.3
[41] tools_4.2.1 data.table_1.14.2 prettyunits_1.1.1 hms_1.1.1 formatR_1.12
[46] BiocIO_1.6.0 lifecycle_1.0.1 stringr_1.4.0 Rhdf5lib_1.18.2 munsell_0.5.0
[51] DelayedArray_0.22.0 lambda.r_1.2.4 AnnotationDbi_1.58.0 compiler_4.2.1 rlang_1.0.2
[56] rhdf5_2.40.0 futile.logger_1.4.3 grid_4.2.1 RCurl_1.98-1.6 rhdf5filters_1.8.0
[61] rstudioapi_0.13 rjson_0.2.21 rappdirs_0.3.3 bitops_1.0-7 gtable_0.3.0
[66] restfulr_0.0.13 DBI_1.1.2 curl_4.3.2 R6_2.5.1 gridExtra_2.3
[71] GenomicAlignments_1.32.0 dplyr_1.0.9 rtracklayer_1.56.0 fastmap_1.1.0 bit_4.0.4
[76] utf8_1.2.2 filelock_1.0.2 futile.options_1.0.1 stringi_1.7.6 parallel_4.2.1
[81] Rcpp_1.0.8.3 vctrs_0.4.1 png_0.1-7 dbplyr_2.1.1 tidyselect_1.1.2
Hi Stephany,
-
that's right, looks like the -500 off-target padding default is used. Out of curiosity, any reason you want to change this? I remember picking that cutoff after looking at our 3000X data where the coverage left and right of a bait dropped to 0 long before reaching 500. So that's the gap between baits and off-target region that is ignored to avoid on-target reads making it into off-target regions. If there is a real need to change -500, I can add an option for that.
-
is a fair feature request and I'm surprised it took so long until it made it here :-) out of curiosity, how different is the automatic annotation compared to yours? So unfortunately for now, you would need to manually change the annotation.
I don't recommend using preprocessIntervals over IntervalFile.R. The latter makes a lot more sanity checks etc.
Markus
Hi Markus, thanks for replying :)
-
I noticed this mainly because of the R package vignette, it says: "Some users got acceptable results with small (<1Mb) panels. Try to find a perfect off-target bin width (average.off.target.width in preprocessIntervals) and maximize the number of heterozygous SNPs by including as much padding as possible". So I thought I should try to change any type of padding, on and off target, although I think here you meant the padding in Mutect. Now that you explain what this padding actually is, maybe it doesn't make sense to change.
-
The automatic annotation is very close to mine actually, but I was thinking that I don't want re-annotate everytime I run IntervalFile.R, since I already know what they should be. And because the db packages you fetch to annotate might get updated or something, and then I might get a different annotation (not likely, but who knows).
Stephany
Yes, I recommend using the baits (not the exons, the actual capture baits) BED file as you got it from the vendor in IntervalFile.R. Then run Mutect with interval padding. 50 is the default which is fine. But with high coverage, you should get a few more heterozygous SNPs by increasing it further. Much beyond your read length doesn't make a lot sense, since there won't be coverage anyways. If you increase it beyond 50, make sure to let PureCN.R know by increasing --padding.
IntervalFile.R should warn you if the off-target width isn't optimal.
does this "off target" cover the entire genome outside the baits + padding? I have very high coverage, so I get plenty of off target mapping, I was thinking that if I call SNVs in the entire genome then I can also feed this into PureCN.R? but then what would I tell the padding? or does this not make sense.
If you provide the proper baits file and use a 100-200bp padding in your variant calling, you shouldn't get much that's worth adding. It shouldn't hurt too much, but the VCFs become huge and most will be filtered out. I think you need to explicitly say that SNPs in off-target regions are kept. I don't do it in our assays.
Off-target is anything between baits +/- 500bp padding reaching the minimum size. It's used to calculate the tumor vs normal coverage to get a more accurate ploidy.