vg icon indicating copy to clipboard operation
vg copied to clipboard

Mapping very short reads with vg giraffe

Open gamerino opened this issue 1 year ago • 5 comments

Hi vg team!

I'm trying to use vg giraffe for mapping very short paired-end reads (36bp) against a pangenome.

I tried running this command line

> vg giraffe -M 10 -x ${panIdxPath}/${panPre}.xg -g ${panIdxPath}/${panPre}.gg \
-H ${panIdxPath}/${panPre}.gbwt -L 2000 -m ${panIdxPath}/${panPre}.min  \
-d ${panIdxPath}/${panPre}.dist -p -f ${fastq1} -f ${fastq2} -t 30 1> ${outPre}.gam \
--report-name ${log} 2> ${log}_run.log

The program didn't produce an error but it returned single-end alignments since it failed to pair reads.

warning[vg::giraffe]: Encountered 100000 ambiguously-paired reads before finding enough
                      unambiguously-paired reads to learn fragment length distribution. Are you sure
                      your reads are paired and your graph is not a hairball?
warning[vg::giraffe]: Finalizing fragment length distribution before reaching maximum sample size
                      mapped 0 reads single ended with 100000 pairs of reads left unmapped
                      mean: 0, stdev: 1
warning[vg::giraffe]: Cannot cluster reads with a fragment distance smaller than read distance
                      Fragment length distribution: mean=0, stdev=1
                      Fragment distance limit: 2, read distance limit: 200
warning[vg::giraffe]: Falling back on single-end mapping
Using fragment length estimate: 0 +/- 1

I tried re-running vg giraffe setting different values for some parameters I understood are related to read length:

  • --distance-limit 36
  • --distance-limit 86
  • --paired-distance-limit 3
  • --rescue-algorithm gssw

but in all cases, I obtained the same results.

I also tried to manually set --fragment-mean 86 and --fragment-stdev 80. In this case, warning messages didn't appear in the log file, instead it reports the defined values were used.

Using fragment length estimate: 86 +/- 80

However, when I checked the stats (samtools flagstats) for the bam file obtained with vg surject, the percentage of properly paired reads is 0%

> vg surject -b -i -t 30 -x ${panIdxPath}/${panPre}.xg ${prefix}.gam 1> ${prefix}.bam
> samtools flagstat -@30 ${prefix}.bam

326331274 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 mapped (0.00% : N/A)
326331274 + 0 paired in sequencing
163165637 + 0 read1
163165637 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

I read the documentation and wiki page but I didn't find a way to fix it. I also tried to look for a parameter for reducing the seed value (default 29) but I didn't find it.

Is there any limitation in the vg giraffe implementation for the minimum read length for paired-end experiments? I ran it using 76PE and 101PE and it worked fine. If it is not, can you please recommend me alternatives for being able to get alignments for properly paired reads for this experiment?

Thank you for your help!

Gabriela

gamerino avatar Jun 20 '23 08:06 gamerino