SpecHLA
SpecHLA copied to clipboard
ExtractHLAread.sh
Hi, When I apply the ExtractHLAread.sh script, the unpaired reads produced are far more than the paired reads. After researching, I saw that some people say the bam file should be sorted by read name before converting to fastq, while in ExtractHLAread.sh, it is sorted by coordinates. However, when I switched to sorting by read name, although it reduced a lot of unpaired reads, running SpecHLA.sh afterwards generated a large number of “-” results. This is very confusing, do you know what might be the situation?
Hi,
- The bam file should be sorted by coordinates using
samtools sort
, becauseExtractHLAread.sh
extracts HLA-related reads by coordinates. In convertingbam
tofastq
,samtools fastq
can handle suchbam
file.
As for the large number of unpaired reads, please check following two points:
- please ensure the reference version (hg19 or hg38) is the same between generating bam and running
ExtractHLAread.sh
. - For the unpaired reads, please check the alignment situation in the
bam
, see what alleles the unextracted read end mapped to.
please let me know if it works.
Best, shuai
Hi shuai,
I looked at samtools fastq --help, which also says it needs to be sorted by reads name. However, using samtools fastq for bam files sorted by reads name and coordinates yielded little difference in results. With bam bam2Fastq, however, bam files sorted by read name get 20 times as many paired reads as those sorted by coordinates.
Hi,
Apologies for any confusion caused. It seems that ExtractHLAread.sh
utilizes bam bam2Fastq
to obtain the fastq
files. In order to investigate the discrepancy in read numbers resulting from bam bam2Fastq
, I suggest examining the reads extracted from the "sorted by read name" but not from the "sorted by coordinates" approach. Please select a subset of these reads and verify the regions to which they are mapped by examining the BAM
. If the reads belong to the HLA region, it is likely that the "sorted by read name" approach is correct; otherwise, the "sorted by coordinates" approach may be more appropriate.
By the way, could you please check the alignment detail of the reads ("unpaired reads produced are far more than the paired reads") in the original BAM
.
Hi shuai, Thanks for your reply and suggestions, i will check this later. I noticed that others were also discussing the issue of preprocessing and gave some suggestions. This might help: https://github.com/Kingsford-Group/kourami/issues/30 https://bitbucket.nygenome.org/projects/COMPBIO/repos/hla_prep/browse
Thanks for the information. According to the issue, the reason of extracting too much unpaired reads might be keeping multiple alignments other than only the primary alignment.