SpecHLA icon indicating copy to clipboard operation
SpecHLA copied to clipboard

ExtractHLAread.sh

Open vegetableyu opened this issue 1 year ago • 5 comments

Hi, When I apply the ExtractHLAread.sh script, the unpaired reads produced are far more than the paired reads. After researching, I saw that some people say the bam file should be sorted by read name before converting to fastq, while in ExtractHLAread.sh, it is sorted by coordinates. However, when I switched to sorting by read name, although it reduced a lot of unpaired reads, running SpecHLA.sh afterwards generated a large number of “-” results. This is very confusing, do you know what might be the situation?

vegetableyu avatar Dec 09 '23 15:12 vegetableyu

Hi,

  • The bam file should be sorted by coordinates using samtools sort, because ExtractHLAread.sh extracts HLA-related reads by coordinates. In converting bam to fastq, samtools fastq can handle such bam file.

As for the large number of unpaired reads, please check following two points:

  • please ensure the reference version (hg19 or hg38) is the same between generating bam and running ExtractHLAread.sh.
  • For the unpaired reads, please check the alignment situation in the bam, see what alleles the unextracted read end mapped to.

please let me know if it works.

Best, shuai

wshuai294 avatar Dec 10 '23 07:12 wshuai294

Hi shuai, I looked at samtools fastq --help, which also says it needs to be sorted by reads name. However, using samtools fastq for bam files sorted by reads name and coordinates yielded little difference in results. With bam bam2Fastq, however, bam files sorted by read name get 20 times as many paired reads as those sorted by coordinates. samtools

vegetableyu avatar Dec 11 '23 06:12 vegetableyu

Hi, Apologies for any confusion caused. It seems that ExtractHLAread.sh utilizes bam bam2Fastq to obtain the fastq files. In order to investigate the discrepancy in read numbers resulting from bam bam2Fastq, I suggest examining the reads extracted from the "sorted by read name" but not from the "sorted by coordinates" approach. Please select a subset of these reads and verify the regions to which they are mapped by examining the BAM. If the reads belong to the HLA region, it is likely that the "sorted by read name" approach is correct; otherwise, the "sorted by coordinates" approach may be more appropriate.

By the way, could you please check the alignment detail of the reads ("unpaired reads produced are far more than the paired reads") in the original BAM.

wshuai294 avatar Dec 11 '23 14:12 wshuai294

Hi shuai, Thanks for your reply and suggestions, i will check this later. I noticed that others were also discussing the issue of preprocessing and gave some suggestions. This might help: https://github.com/Kingsford-Group/kourami/issues/30 https://bitbucket.nygenome.org/projects/COMPBIO/repos/hla_prep/browse

vegetableyu avatar Dec 13 '23 06:12 vegetableyu

Thanks for the information. According to the issue, the reason of extracting too much unpaired reads might be keeping multiple alignments other than only the primary alignment.

wshuai294 avatar Dec 13 '23 09:12 wshuai294