SigProfilerSimulator icon indicating copy to clipboard operation
SigProfilerSimulator copied to clipboard

Lower mutation number in simulation than observed

Open cutleraging opened this issue 1 year ago • 5 comments

Hello,

Thanks for a great tool! I have been using it to randomize the position of mutations in samples while controlling for mutational signatures. However, I have noticed that I am not getting the same amount of mutations in the simulated files as in the input, which I would expect.

I am running the tool like this, including some specific regions to simulate in (based on coverage): `#!/usr/bin/env python

import sys from SigProfilerSimulator import SigProfilerSimulator as sigSim

name = sys.argv[1] vcf_dir = sys.argv[2] bed = sys.argv[3].strip()

sigSim.SigProfilerSimulator(name, vcf_dir, "GRCh37", contexts=["96", "ID"], exome=None, simulations=10, updating=False, bed_file=bed, overlap=False, gender='female', chrom_based=False, seed_file=None, noisePoisson=False, cushion=100, region=None, vcf=True)`

Here is the output of the log file: `====================================== SigProfilerSimulator Checking for all reference files and relevant matrices... Creating a chromosome proportion file for the given BED file ranges...Completed! SigProfilerSimulator_cell_output/C1_05_24/output/ID/C1_05_24.ID83.region does not exist. Creating the matrix file now. Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 116.12 seconds. Starting matrix generation for INDELs...Completed! Elapsed time: 106.96 seconds. Matrices generated for 1 samples with 0 errors. Total of 597 SNVs, 1 DINUCs, and 15 INDELs were successfully analyzed. The context distribution file does not exist. This file needs to be created before simulating. This may take several hours... Chromosome X done Chromosome 1 done Chromosome 2 done Chromosome 3 done Chromosome 4 done Chromosome 5 done Chromosome 6 done Chromosome 7 done Chromosome 8 done Chromosome 9 done Chromosome 10 done Chromosome 11 done Chromosome 12 done Chromosome 13 done Chromosome 14 done Chromosome 15 done Chromosome 16 done Chromosome 17 done Chromosome 18 done Chromosome 19 done Chromosome 20 done Chromosome 21 done The context distribution file has been created! The context distribution file does not exist. This file needs to be created before simulating. This may take several hours... Chromosome X done Chromosome 1 done Chromosome 2 done Chromosome 3 done Chromosome 4 done Chromosome 5 done Chromosome 6 done Chromosome 7 done Chromosome 8 done Chromosome 9 done Chromosome 10 done Chromosome 11 done Chromosome 12 done Chromosome 13 done Chromosome 14 done Chromosome 15 done Chromosome 16 done Chromosome 17 done Chromosome 18 done Chromosome 19 done Chromosome 20 done Chromosome 21 done The context distribution file has been created!

Files successfully read and mutations collected. Mutation assignment starting now. Mutations have been distributed. Starting simulation now... Chromosome 21 done Chromosome 22 done Chromosome 19 done Chromosome 20 done Chromosome 18 done Chromosome 13 done Chromosome 17 done Chromosome 15 done Chromosome 16 done Chromosome 14 done Chromosome 9 done Chromosome 11 done Chromosome 10 done Chromosome 12 done Chromosome X done Chromosome 8 done Chromosome 7 done Chromosome 6 done Chromosome 4 done Chromosome 5 done Chromosome 3 done Chromosome 1 done Chromosome 2 done Simulation completed Job took 2381.3919444084167 seconds`

In my input file I have 613 mutations, but in the simulated output I only get 313 mutations. This issue is not specific to this file. Any help with this? I have tried to circumvent this by creating many more simulations than I need and sampling them to get the matching number of mutations as in my real sample.

Thanks, Ronnie

cutleraging avatar Dec 19 '23 17:12 cutleraging

Hi @cutleraging,

Thanks for reaching out.

I think you are using a customized bed file. When you are specifying your own bed file, the tool simulates on custom regions of the genome. All mutations of your input vcf file might not fall into custom regions. That's why it might showing less mutations.

Can I ask you to check the position of the input vcf file with bed file to see if all the mutations overlapped. Please let me know if you encounter any other issues.

Best, Mousumy

MousumyCSE avatar Dec 22 '23 03:12 MousumyCSE

Hi Mousumy,

In R, I checked if my mutations overlapped the regions in the bed file as follows

> vcf.obs <- read_vcfs_as_granges("MMEA9_E3.vcf",
                                "MMEA9_E3.vcf",
                                genome = BSgenome.Hsapiens.UCSC.hg19,
                                group = "all",
                                type = "all")
> vcf.obs <- vcf.obs$MMEA9_E3.vcf

> bed <- import.bed("MMEA9_E3.bed")
> seqlevelsStyle(bed) <- "UCSC"
> length(unique(queryHits(overlaps.obs))) == length(vcf.obs)
[1] FALSE

I noticed this was because the BED file uses 0-indexing, while when it is imported into a Granges object, it becomes 1-indexed. So essentially all start positions are increased by +1.

So when I corrected for this, I now see that indeed all positions are covered

> vcf.obs <- read_vcfs_as_granges("MMEA9_E3.vcf",
                                "MMEA9_E3.vcf",
                                genome = BSgenome.Hsapiens.UCSC.hg19,
                                group = "all",
                                type = "all")
> vcf.obs <- vcf.obs$MMEA9_E3.vcf

> bed <- import.bed("MMEA9_E3.bed")
> start(bed) <- start(bed) - 1
> seqlevelsStyle(bed) <- "UCSC"
> length(unique(queryHits(overlaps.obs))) == length(vcf.obs)
[1] TRUE

Now I am checking one of the simulated files produced from the same sample and notice that there are over 3000 simulated mutations which do not lie in the covered regions of the bed file

> vcf.sim <- read_vcfs_as_granges("MMEA9_E3_999.vcf",
                                "MMEA9_E3_999.vcf",
                                genome = BSgenome.Hsapiens.UCSC.hg19,
                                group = "all",
                                type = "all")
> vcf.sim <- vcf.sim$MMEA9_E3.vcf

> bed <- import.bed("MMEA9_E3.bed")
> start(bed) <- start(bed) - 1
> seqlevelsStyle(bed) <- "UCSC"
> length(unique(queryHits(overlaps.sim))) == length(vcf.sim)
[1] FALSE

So not only does it seem like there are less mutations in the simulated file compared to the observed, but the simulated mutations also are placed outside of the regions given in the bed file.

Would you like a link to the output?

Best, Ronnie

cutleraging avatar Jan 03 '24 00:01 cutleraging

Hi @cutleraging,

Thanks for your explanation. Yes, it will be great if you can share the output files as well as one of your input VCF file and bed file so that I can run at my end to reproduce the same issue.

Thanks! Mousumy

MousumyCSE avatar Jan 05 '24 17:01 MousumyCSE

Hi @MousumyCSE ,

Files are here: https://www.dropbox.com/scl/fo/tva1jkwd5iainmz7c2l7n/h?rlkey=luw0vxidbvghglu0oojgucdm5&dl=0

Thanks for taking a look! Ronnie

cutleraging avatar Jan 05 '24 22:01 cutleraging

Hi @cutleraging,

Thank you for sharing the input files and apologies for the late response.

We have worked on it and updated the tools. Please reinstall the updated tools and run your input files. Just to let you know, if you want that your simulated mutations should fall within the regions of the given bed file then please run the SigProfilerSimulator tool with cushion=0.

Please let me know, if you have any further issues.

Best, Mousumy

MousumyCSE avatar Sep 24 '24 18:09 MousumyCSE

Hi @cutleraging,

Please feel free to reopen this issue if you encounter any problems.

Best, Mousumy

MousumyCSE avatar Oct 01 '24 15:10 MousumyCSE