straglr icon indicating copy to clipboard operation
straglr copied to clipboard

Expanded allele not detected

Open stfacc opened this issue 1 year ago • 3 comments

Is there a reason why the longest allele is not detected for this locus:

reads.txt

Running with these options:

straglr.py \
        $bam \
        $reference \
        $prefix \
        --min_ins_size 20 --min_str_len 3  --max_str_len 6 \
        --nprocs 80 \
        --trf_args 2 5 5 80 10 10 6 \
        --debug

Version: 1.5.2

stfacc avatar Nov 25 '24 13:11 stfacc

Clustering cannot cluster the 3kb with the other sizes. This seems a rather complex locus with both CAA and CGG repeats detected, and each one has some big and small alleles. Have you confirmed this is correct? Also the coordinate suggests this is not a human sample?

readmanchiu avatar Nov 27 '24 01:11 readmanchiu

Thanks for the quick answer.

This is a human sample, it's an expansion in NUTM2B-AS1

A similar targeted analysis with TRGT shows the following structure (CAA probably a sequencing artifact):

S1 NUTM2B-AS1 wf

I appreciate that it's not easy to cluster, my only concern is that this would be completely missed in a genome-wide scan to detect novel expansions.

For a similar sample (with better depth), the same expansion is identified:

reads_s2.txt

S2 NUTM2B-AS1 wf

stfacc avatar Nov 27 '24 10:11 stfacc

So TGRT seems to support Straglr's tsv report. The CAA repeat cannot be regarded as a sequencing artifact. From the pictures it's clear there is an expansion going in this locus, yet the expansion seems mosaic and a concise summarization of the genotype cannot be achieved. Need to have some thoughts on how to represent it in vcf too. Thanks very much for providing this example. There will be some overhauls to the code to be made to make the change happen, which hopefully will be incorporated into the next release.

readmanchiu avatar Nov 28 '24 15:11 readmanchiu

Hi,

I would like to follow up on this issue and add a few questions. I am currently analyzing a sample with mosaic repeat expansion.

I wonder if Straglr is capable of handling mosaic repeat loci. Specifically: 1. Would adjusting the parameter --max_num_clusters help in resolving mosaic alleles? 2. I noticed that Straglr includes some assessment of mosaicism — could you please clarify how this is done? 3. Do you have any recommendations or best practices for analyzing mosaic repeat expansions using Straglr?

Any guidance or insights would be greatly appreciated. Thank you!

Best regards, Hsin

HLHsieh avatar Apr 11 '25 00:04 HLHsieh