samplot-ml icon indicating copy to clipboard operation
samplot-ml copied to clipboard

Snakemake job stops with error for some small variants <130bp

Open gdaviduu opened this issue 2 years ago • 8 comments

Hello,

The Snakemake pipeline works great most of the time but seems to throw an error only for very small (> 1bp) to up to ~70bp Deletions, such as this one:

and lr_steps[i+2].info['TYPE'] == 'Align' : IndexError: list index out of range ERROR with command: Command('bash scripts/gen_img.sh
--chrom scaffold00169 --start 939388 --end 939389 ...', , stdout[:20]: '/proj/snic2019-35-58', exit-code: 1, error: exit status 1, run-time: 1.344511671s)

However it also threw an error for larger DELs, such as this one:

ERROR with command: Command('bash scripts/gen_img.sh
--chrom chr4 --start 61314503 --end 61314570

For the moment my workaround has been to remove all variants <70bp from the VCF, but I did have to remove a few extra variants up to ~90bp or 121bp in length that threw the same error. I attach an example slurm.out file with an example error.

Thank you for your kind help, Gabriel

samplotml_error_out.txt

gdaviduu avatar Sep 16 '22 13:09 gdaviduu

There seems to be a problem with samplot itself when it is called with these regions. I think I'll have to do some debugging on the samplot side of things. Let's compile a list of all such failed genomic intervals (if possible), and if you can provide me with an alignment file for me to reproduce the error that would be great.

It seems like samplot is getting "out of bounds" errors with these regions. I'd be curious to see what the size of these chromosomes are. I wonder if the regions are very close to the boundaries. Since the samplot image will consist of the SV region and some window around the region, it might be the cause of this out of bounds issue. If that's the case then perhaps you could ignore regions that are close to the boundary of the contigs (they'd likely be false positives anyway).

mchowdh200 avatar Sep 16 '22 15:09 mchowdh200

Hi! Thank you for your helpful reply. If it's ok, I can email the alignment file, since this is still unpublished. Would this be the Smoove VCF file that is used as input for Samplot-ML?

And here is a list of (most of the) failed genomic intervals:

chr10 --start 19953778 --end 19953899 chr12 --start 18566326 --end 18566355 chr13 --start 8673492 --end 8673710 chr15 --start 1171112 --end 1171153 chr1A --start 689498 --end 689576 chr1 --start 1764 --end 6472 chr1 --start 60137917 --end 60138125 chr22 --start 1562894 --end 1563004 chr23 --start 1324682 --end 1324763 chr23 --start 1975177 --end 1975939 chr3 --start 107557393 --end 110072670 chr4 --start 57046058 --end 57046154 chr4 --start 61314503 --end 61314570 chr4 --start 69505276 --end 69505442 chr5 --start 50845824 --end 50845850 chr5 --start 8105604 --end 8105605 chr7 --start 34678962 --end 34679057 chrZ --start 68286601 --end 68286689 scaffold00169 --start 939388 --end 939389 scaffold00221 --start 252225 --end 252265 scaffold00404 --start 60959 --end 61003 scaffold00425 --start 42742 --end 42796 scaffold00741 --start 128 --end 175 scaffold01182 --start 9308 --end 9338

gdaviduu avatar Sep 29 '22 19:09 gdaviduu

To reproduce, I'd like the BAM (ie the sequence alignments), reference genome (and index) and the vcf along with your configuration file.

Email me at [email protected]

There's also a chance that we already have the sample in question on our lab's S3 storage, so maybe you don't have to send the actual BAM (sample name/filename might be sufficient).

mchowdh200 avatar Sep 29 '22 20:09 mchowdh200

Also could you post the lengths of the chromosomes for each of these failed cases? My suspicion is that the failed regions are occurring close to the boundaries of the chromosomes. Since Samplot is trying to plot the SV and its flanking regions it might be running into an index out of bounds error as a result.

mchowdh200 avatar Sep 29 '22 20:09 mchowdh200

Thanks! Attached is the list of chromosome sizes, to start with: chr_sizes_genome.txt

I will gather some of the bams we used and the reference genome and VCF to share with you, as soon as I can. Interestingly, for the 33 samples you already have access to, there was no error. These were a subset of the ~430 samples I am running now. Those specific BAMs still do not throw any error.

Thanks again, Gabriel

gdaviduu avatar Sep 30 '22 09:09 gdaviduu

P.S. Here are only the lengths for the chromosomes for which there was an error:

chr10 21077231 chr1 112674304 chr11 20431416 chr12 19787094 chr13 18021457 chr14 16467355 chr15 14042894 chr17 11241730 chr18 11529515 chr19 11116859 chr1A 69873230 chr22 3666047 chr23 7033470 chr3 110987800 chr4 70348008 chr5 61081188 chr7 36521674 chrZ 68733010 scaffold00169 1476803 scaffold00221 827490 scaffold00404 121432 scaffold00425 100469 scaffold00741 30314 scaffold01182 9971

gdaviduu avatar Sep 30 '22 09:09 gdaviduu

I've now emailed all of the requested files. Please let me know if you would like anything else. Thanks so much again!

gdaviduu avatar Oct 01 '22 16:10 gdaviduu

Thanks, I got the email. I'll try to get to the bottom of this issue within a week. I'll let you know if figure something out.

mchowdh200 avatar Oct 03 '22 00:10 mchowdh200