SigProfilerSimulator
SigProfilerSimulator copied to clipboard
Lower mutation number in simulation than observed
Hello,
Thanks for a great tool! I have been using it to randomize the position of mutations in samples while controlling for mutational signatures. However, I have noticed that I am not getting the same amount of mutations in the simulated files as in the input, which I would expect.
I am running the tool like this, including some specific regions to simulate in (based on coverage): `#!/usr/bin/env python
import sys from SigProfilerSimulator import SigProfilerSimulator as sigSim
name = sys.argv[1] vcf_dir = sys.argv[2] bed = sys.argv[3].strip()
sigSim.SigProfilerSimulator(name, vcf_dir, "GRCh37", contexts=["96", "ID"], exome=None, simulations=10, updating=False, bed_file=bed, overlap=False, gender='female', chrom_based=False, seed_file=None, noisePoisson=False, cushion=100, region=None, vcf=True)`
Here is the output of the log file: `====================================== SigProfilerSimulator Checking for all reference files and relevant matrices... Creating a chromosome proportion file for the given BED file ranges...Completed! SigProfilerSimulator_cell_output/C1_05_24/output/ID/C1_05_24.ID83.region does not exist. Creating the matrix file now. Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 116.12 seconds. Starting matrix generation for INDELs...Completed! Elapsed time: 106.96 seconds. Matrices generated for 1 samples with 0 errors. Total of 597 SNVs, 1 DINUCs, and 15 INDELs were successfully analyzed. The context distribution file does not exist. This file needs to be created before simulating. This may take several hours... Chromosome X done Chromosome 1 done Chromosome 2 done Chromosome 3 done Chromosome 4 done Chromosome 5 done Chromosome 6 done Chromosome 7 done Chromosome 8 done Chromosome 9 done Chromosome 10 done Chromosome 11 done Chromosome 12 done Chromosome 13 done Chromosome 14 done Chromosome 15 done Chromosome 16 done Chromosome 17 done Chromosome 18 done Chromosome 19 done Chromosome 20 done Chromosome 21 done The context distribution file has been created! The context distribution file does not exist. This file needs to be created before simulating. This may take several hours... Chromosome X done Chromosome 1 done Chromosome 2 done Chromosome 3 done Chromosome 4 done Chromosome 5 done Chromosome 6 done Chromosome 7 done Chromosome 8 done Chromosome 9 done Chromosome 10 done Chromosome 11 done Chromosome 12 done Chromosome 13 done Chromosome 14 done Chromosome 15 done Chromosome 16 done Chromosome 17 done Chromosome 18 done Chromosome 19 done Chromosome 20 done Chromosome 21 done The context distribution file has been created!
Files successfully read and mutations collected. Mutation assignment starting now. Mutations have been distributed. Starting simulation now... Chromosome 21 done Chromosome 22 done Chromosome 19 done Chromosome 20 done Chromosome 18 done Chromosome 13 done Chromosome 17 done Chromosome 15 done Chromosome 16 done Chromosome 14 done Chromosome 9 done Chromosome 11 done Chromosome 10 done Chromosome 12 done Chromosome X done Chromosome 8 done Chromosome 7 done Chromosome 6 done Chromosome 4 done Chromosome 5 done Chromosome 3 done Chromosome 1 done Chromosome 2 done Simulation completed Job took 2381.3919444084167 seconds`
In my input file I have 613 mutations, but in the simulated output I only get 313 mutations. This issue is not specific to this file. Any help with this? I have tried to circumvent this by creating many more simulations than I need and sampling them to get the matching number of mutations as in my real sample.
Thanks, Ronnie
Hi @cutleraging,
Thanks for reaching out.
I think you are using a customized bed file. When you are specifying your own bed file, the tool simulates on custom regions of the genome. All mutations of your input vcf file might not fall into custom regions. That's why it might showing less mutations.
Can I ask you to check the position of the input vcf file with bed file to see if all the mutations overlapped. Please let me know if you encounter any other issues.
Best, Mousumy
Hi Mousumy,
In R, I checked if my mutations overlapped the regions in the bed file as follows
> vcf.obs <- read_vcfs_as_granges("MMEA9_E3.vcf",
"MMEA9_E3.vcf",
genome = BSgenome.Hsapiens.UCSC.hg19,
group = "all",
type = "all")
> vcf.obs <- vcf.obs$MMEA9_E3.vcf
> bed <- import.bed("MMEA9_E3.bed")
> seqlevelsStyle(bed) <- "UCSC"
> length(unique(queryHits(overlaps.obs))) == length(vcf.obs)
[1] FALSE
I noticed this was because the BED file uses 0-indexing, while when it is imported into a Granges object, it becomes 1-indexed. So essentially all start positions are increased by +1.
So when I corrected for this, I now see that indeed all positions are covered
> vcf.obs <- read_vcfs_as_granges("MMEA9_E3.vcf",
"MMEA9_E3.vcf",
genome = BSgenome.Hsapiens.UCSC.hg19,
group = "all",
type = "all")
> vcf.obs <- vcf.obs$MMEA9_E3.vcf
> bed <- import.bed("MMEA9_E3.bed")
> start(bed) <- start(bed) - 1
> seqlevelsStyle(bed) <- "UCSC"
> length(unique(queryHits(overlaps.obs))) == length(vcf.obs)
[1] TRUE
Now I am checking one of the simulated files produced from the same sample and notice that there are over 3000 simulated mutations which do not lie in the covered regions of the bed file
> vcf.sim <- read_vcfs_as_granges("MMEA9_E3_999.vcf",
"MMEA9_E3_999.vcf",
genome = BSgenome.Hsapiens.UCSC.hg19,
group = "all",
type = "all")
> vcf.sim <- vcf.sim$MMEA9_E3.vcf
> bed <- import.bed("MMEA9_E3.bed")
> start(bed) <- start(bed) - 1
> seqlevelsStyle(bed) <- "UCSC"
> length(unique(queryHits(overlaps.sim))) == length(vcf.sim)
[1] FALSE
So not only does it seem like there are less mutations in the simulated file compared to the observed, but the simulated mutations also are placed outside of the regions given in the bed file.
Would you like a link to the output?
Best, Ronnie
Hi @cutleraging,
Thanks for your explanation. Yes, it will be great if you can share the output files as well as one of your input VCF file and bed file so that I can run at my end to reproduce the same issue.
Thanks! Mousumy
Hi @MousumyCSE ,
Files are here: https://www.dropbox.com/scl/fo/tva1jkwd5iainmz7c2l7n/h?rlkey=luw0vxidbvghglu0oojgucdm5&dl=0
Thanks for taking a look! Ronnie
Hi @cutleraging,
Thank you for sharing the input files and apologies for the late response.
We have worked on it and updated the tools. Please reinstall the updated tools and run your input files. Just to let you know, if you want that your simulated mutations should fall within the regions of the given bed file then please run the SigProfilerSimulator tool with cushion=0.
Please let me know, if you have any further issues.
Best, Mousumy
Hi @cutleraging,
Please feel free to reopen this issue if you encounter any problems.
Best, Mousumy