GangSTR
GangSTR copied to clipboard
GangSTR fails on small input files
I am part of a team doing targeted sequencing of PolyG loci. We have previously been using LobSTR for genotyping polyG sequences from reads prior to a consensus making step (see https://github.com/risqueslab/PolyG-DS). I was investigating GangSTR on the recommendation of one of our collaborators to see if we might want to switch to using it. However, I keep running into the following message (for most of our loci):
[GangSTR-2.5.0] ProgressMeter: Processing chr1:46990539
[GangSTR-2.5.0] ProgressMeter: Setting flanking regions
[GangSTR-2.5.0] ProgressMeter: Loading read data
[GangSTR-2.5.0] ProgressMeter: Not enough reads extracted. Skipping locus..
and the .readinfo.tab
file is empty. I know that these input files generated good polyG data using lobSTR, so I was wondering if you had any idea what might be happening and how we might avoid it happening in the future. An example line from our regions file (hg19-based) is:
chr1 46990539 46990839 1 C
Hello,
It seems like GangSTR is rejecting all of the reads that are picked up from that region. Can you try changing the parameters for local realignment to see if they make a difference on the output? For example, --minmatch
is the minimum number of bases surrounding the repeat region that must match the flanking sequence around the repeat. By default it's set to 5, but sometimes around runs of homopolymers the sequencing error rate is very high and that may lead to rejecting the reads.
Please let me know if that didn't help.
Best,
Nima
Reducing --minmatch doesn't seem to have helped (in this case, at least); I'm still getting an empty file for the .readinfo.tab file. My command looks like this right now:
~/bioinformatics/programs/GangSTR-2.5.0-Source/build/GangSTR \
--bam ${inBaseName}.sort.bam \
--targeted \
--ref ${inRef} \
--regions ${inReg} \
--out ${inBaseName}.gangSTR8 \
--output-readinfo \
--verbose \
--insertmean 1 \
--insertsdev 1 \
--minmatch 1 \
--min-sample-reads 1 \
--max-proc-read 1000000 \
--output-readinfo \
--output-bootstraps \
--read-prob-mode \
--verbose --very
I also tried --minmatch 3
, but it didn't output anything either. Another (possibly related) issue is that the VCF file has "." for every row in the SAMPLE column, and the only things it seems to be picking up on are whole reads where the whole read is the same base (and which reads would normally be filtered out as being obviously bad).
This data originates from pools of CRISPR-based cut sites, if that helps any in figuring out what happened.
Humm I'm not sure what the issue may be. GangSTR has been mostly
used/tested on paired-end PCR free WGS datasets. I haven't worked with data
generated from CRISPR-based cut site, but I assume several of the
assumptions made by GangSTR may not hold in them. Having .
in the sample
column is expected since GangSTR isn't able to make any calls for any
samples.
Another potential problem is that GangSTR focuses on genotyping long TRs
using the context derived from paired-end reads. So for genotyping
variations that fit within a single read (and mutations around it), HipSTR
is probably a more suitable method.
Hope this helps.
Best,
Nima
On Wed, Mar 3, 2021 at 11:01 AM bkohrn [email protected] wrote:
Reducing --minmatch doesn't seem to have helped (in this case, at least); I'm still getting an empty file for the .readinfo.tab file. My command looks like this right now:
~/bioinformatics/programs/GangSTR-2.5.0-Source/build/GangSTR
--bam ${inBaseName}.sort.bam
--targeted
--ref ${inRef}
--regions ${inReg}
--out ${inBaseName}.gangSTR8
--output-readinfo
--verbose
--insertmean 1
--insertsdev 1
--minmatch 1
--min-sample-reads 1
--max-proc-read 1000000
--output-readinfo
--output-bootstraps
--read-prob-mode
--verbose --veryI also tried --minmatch 3, but it didn't output anything either. Another (possibly related) issue is that the VCF file has "." for every row in the SAMPLE column, and the only things it seems to be picking up on are whole reads where the whole read is the same base (and which reads would normally be filtered out as being obviously bad).
This data originates from pools of CRISPR-based cut sites, if that helps any in figuring out what happened.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/gymreklab/GangSTR/issues/103*issuecomment-789978999__;Iw!!Mih3wA!TvuoGaXe75rEIGF8ArKOt9ZQs8-I-Li6qrfWevyiWUWYnFuQuM6KkR0IRVN66mo$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ADSOQCDP7QU362DNAKWRUHLTB2BPTANCNFSM4YPT3BUQ__;!!Mih3wA!TvuoGaXe75rEIGF8ArKOt9ZQs8-I-Li6qrfWevyiWUWYnFuQuM6KkR0ItDZn534$ .