PureCN IntervalFile.R command line behaviour different than preprocessIntervals()

Describe the issue

Case 1: IntervalFile.R does not have an equivalent for the off.target.padding argument in preprocessIntervals() Case 2: preprocessIntervals() does not keep the Gene column from a GRanges object specified in interval.file.

To Reproduce

Case 2:

bed.file <- system.file("extdata", "ex2_intervals.bed",
                        package = "PureCN", mustWork = TRUE)
reference.file <- system.file("extdata", "ex2_reference.fa",
                              package = "PureCN", mustWork = TRUE)
intervals <- import(bed.file)

#dummy gene names
intervals$Gene <- letters[1:5]

preprocessIntervals(interval.file = intervals,
                    reference.file = reference.file,
                    output.file = "gc_file.txt"
)

gc_file.txt contents, have a point where the gene name should be:

@HD	VN:1.6
@SQ	SN:seq1	LN:800
@SQ	SN:seq2	LN:800
Target	gc_bias	mappability	reptiming	Gene	on_target
seq1:101-250	0.453333333333333	NA	NA	.	TRUE
seq1:301-650	0.505714285714286	NA	NA	.	TRUE
seq2:1-150	0.573333333333333	NA	NA	.	TRUE
seq2:401-550	0.48	NA	NA	.	TRUE
seq2:676-775	0.41	NA	NA	.	TRUE

Expected behavior

Case 2: In addition to the above, the command line IntervalFile.R automatically adds annotation, I would like to use custom gene names per bait.

Session Info

R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.5 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_CH.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=de_CH.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=de_CH.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_CH.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] PureCN_2.2.0 VariantAnnotation_1.42.1 Rsamtools_2.12.0 Biostrings_2.64.0
[5] XVector_0.36.0 SummarizedExperiment_1.26.1 Biobase_2.56.0 GenomicRanges_1.48.0
[9] GenomeInfoDb_1.32.2 IRanges_2.30.0 S4Vectors_0.34.0 MatrixGenerics_1.8.0
[13] matrixStats_0.62.0 BiocGenerics_0.42.0 DNAcopy_1.70.0

loaded via a namespace (and not attached): [1] httr_1.4.3 VGAM_1.1-7 bit64_4.0.5 splines_4.2.1 assertthat_0.2.1
[6] BiocFileCache_2.4.0 blob_1.2.3 BSgenome_1.64.0 GenomeInfoDbData_1.2.8 yaml_2.3.5
[11] progress_1.2.2 pillar_1.7.0 RSQLite_2.2.14 lattice_0.20-45 glue_1.6.2
[16] digest_0.6.29 RColorBrewer_1.1-3 colorspace_2.0-3 Matrix_1.5-1 XML_3.99-0.9
[21] pkgconfig_2.0.3 biomaRt_2.52.0 zlibbioc_1.42.0 purrr_0.3.4 scales_1.2.0
[26] BiocParallel_1.30.2 tibble_3.1.7 KEGGREST_1.36.0 ggplot2_3.3.6 generics_0.1.2
[31] ellipsis_0.3.2 cachem_1.0.6 GenomicFeatures_1.48.1 cli_3.3.0 mclust_5.4.10
[36] magrittr_2.0.3 crayon_1.5.1 memoise_2.0.1 fansi_1.0.3 xml2_1.3.3
[41] tools_4.2.1 data.table_1.14.2 prettyunits_1.1.1 hms_1.1.1 formatR_1.12
[46] BiocIO_1.6.0 lifecycle_1.0.1 stringr_1.4.0 Rhdf5lib_1.18.2 munsell_0.5.0
[51] DelayedArray_0.22.0 lambda.r_1.2.4 AnnotationDbi_1.58.0 compiler_4.2.1 rlang_1.0.2
[56] rhdf5_2.40.0 futile.logger_1.4.3 grid_4.2.1 RCurl_1.98-1.6 rhdf5filters_1.8.0
[61] rstudioapi_0.13 rjson_0.2.21 rappdirs_0.3.3 bitops_1.0-7 gtable_0.3.0
[66] restfulr_0.0.13 DBI_1.1.2 curl_4.3.2 R6_2.5.1 gridExtra_2.3
[71] GenomicAlignments_1.32.0 dplyr_1.0.9 rtracklayer_1.56.0 fastmap_1.1.0 bit_4.0.4
[76] utf8_1.2.2 filelock_1.0.2 futile.options_1.0.1 stringi_1.7.6 parallel_4.2.1
[81] Rcpp_1.0.8.3 vctrs_0.4.1 png_0.1-7 dbplyr_2.1.1 tidyselect_1.1.2

Oct 18 '22 14:10 sorjuela

Hi Stephany,

that's right, looks like the -500 off-target padding default is used. Out of curiosity, any reason you want to change this? I remember picking that cutoff after looking at our 3000X data where the coverage left and right of a bait dropped to 0 long before reaching 500. So that's the gap between baits and off-target region that is ignored to avoid on-target reads making it into off-target regions. If there is a real need to change -500, I can add an option for that.
is a fair feature request and I'm surprised it took so long until it made it here :-) out of curiosity, how different is the automatic annotation compared to yours? So unfortunately for now, you would need to manually change the annotation.

I don't recommend using preprocessIntervals over IntervalFile.R. The latter makes a lot more sanity checks etc.

Markus

Oct 25 '22 00:10 lima1

Hi Markus, thanks for replying :)

I noticed this mainly because of the R package vignette, it says: "Some users got acceptable results with small (<1Mb) panels. Try to find a perfect off-target bin width (average.off.target.width in preprocessIntervals) and maximize the number of heterozygous SNPs by including as much padding as possible". So I thought I should try to change any type of padding, on and off target, although I think here you meant the padding in Mutect. Now that you explain what this padding actually is, maybe it doesn't make sense to change.
The automatic annotation is very close to mine actually, but I was thinking that I don't want re-annotate everytime I run IntervalFile.R, since I already know what they should be. And because the db packages you fetch to annotate might get updated or something, and then I might get a different annotation (not likely, but who knows).

Stephany

Oct 25 '22 08:10 sorjuela

Yes, I recommend using the baits (not the exons, the actual capture baits) BED file as you got it from the vendor in IntervalFile.R. Then run Mutect with interval padding. 50 is the default which is fine. But with high coverage, you should get a few more heterozygous SNPs by increasing it further. Much beyond your read length doesn't make a lot sense, since there won't be coverage anyways. If you increase it beyond 50, make sure to let PureCN.R know by increasing --padding.

IntervalFile.R should warn you if the off-target width isn't optimal.

Oct 25 '22 13:10 lima1

does this "off target" cover the entire genome outside the baits + padding? I have very high coverage, so I get plenty of off target mapping, I was thinking that if I call SNVs in the entire genome then I can also feed this into PureCN.R? but then what would I tell the padding? or does this not make sense.

Nov 09 '22 15:11 sorjuela

If you provide the proper baits file and use a 100-200bp padding in your variant calling, you shouldn't get much that's worth adding. It shouldn't hurt too much, but the VCFs become huge and most will be filtered out. I think you need to explicitly say that SNPs in off-target regions are kept. I don't do it in our assays.

Off-target is anything between baits +/- 500bp padding reaching the minimum size. It's used to calculate the tumor vs normal coverage to get a more accurate ploidy.

Nov 09 '22 20:11 lima1