straglr icon indicating copy to clipboard operation
straglr copied to clipboard

Not all loci in output bedfile

Open ljohansson opened this issue 11 months ago • 2 comments

Dear @readmanchiu,

I am using straglr via the vip pipeline (https://github.com/molgenis/vip/), as described in an earlier thread. Here, we use as an input a bed file with loci. However, I expected all loci to be in the output vcf. However, often loci are missing in the output. Which ones are in and out differs per sample.

We have not yet found the cause of the missing loci. Could it be that straglr filters loci based on quality? If so, is there an option to force all loci in the vcf and use the QUAL and filter columns to indicate the low quality, but keep the locus in the output vcf file?

Because we are using the @philres fork, it could be an issue related to that fork, but I believe this question is not related to the altered code. If you have any insights they would be very welcome.

ljohansson avatar Mar 15 '24 07:03 ljohansson

I haven't tried the vip pipeline - from what you wrote, you can specify the source of Straglr and you guys are using @philres fork. Straglr's only filtering is based on the number of supporting reads, and the number of events (number of loci, not the number of lines) should be the same between the tsv and bed files. I don't know if the @philres fork is doing any filtering when it's converting Straglr's output to VCF. Anyways, I've been asked to produce an VCF output. Right now I'm still at the investigation phase, but it's targeted for the next release.

readmanchiu avatar Mar 16 '24 01:03 readmanchiu

VCF output has been added to v1.5.0 Some loci may be missed possibly because provided targeted motif do not match detected motif. Feel free to send me data for investigation if possble.

readmanchiu avatar Apr 17 '24 17:04 readmanchiu

Dear @readmanchiu, Apologies for not reacting sooner. I had missed your replies. Thank you for adding vcf output to straglr. In the meantime MOLGENIS VIP has created their own Straglr fork (https://github.com/molgenis/straglr). We have learnt that in the philres fork variants are filtered when the number of RU match the reference genome. In that case the repeat is considered not to be a variant.

ljohansson avatar Jul 05 '24 11:07 ljohansson

Running Straglr in the the genome scan mode will only report loci that are larger than the reference, whereas running it in the genotype mode (with loci-of-interest provided) will return genotypes of all loci regardless of whether they are the same as reference or not. I guess you should check in the new vcf output whether there is still any missing loci.

readmanchiu avatar Jul 05 '24 23:07 readmanchiu

Thank you. For now I will close this issue. I did have a different question on an issue that was already closed. (https://github.com/bcgsc/straglr/issues/19) I could not reopen it, but to not double issues I put it there.

ljohansson avatar Jul 09 '24 10:07 ljohansson

#Closed for now

ljohansson avatar Jul 09 '24 10:07 ljohansson