mgatk
mgatk copied to clipboard
sumstatsBP_overlap.py failed to handle paired-end reads with unmatched length in the overlapped region.
Describe the bug Hi again! Continue with #85 where I'm running a atac-seq dataset with unequal R1 and R2 read lengths as I trimmed the low-quality area of R1 after sequencing. It throws out a similar error message that the overlapped regions between R1 and R2 have different lengths after correcting the splitting issue.
After checking the result, the sequence length (len(fwd_seq)) is unequal to the distance between the ref start and ref end, probably due to the unmatched indels between R1 and R2 caused by sequencing error.
Additional context I modified the sumstatsBP_overlap.py by adding codes to compare the overlapped length between R1 and R2. I discard the entire read pairs if they don't match, as it's caused by sequencing error and only a few of the reads have this issue. (the file is the same as the one in #85 but try to upload it again: sumstatsBP_overlap.zip )
Same as before, I'm sure this is not the best way but rather a stringent filter to those reads, and super happy if it's possible to discuss more about it.
Best, Qirong
thanks for digging into this. maybe the best way is to take the min of the lengths of R1 and R2 in settings where they are not equal rather than fully discarding? you wouldn't get an index error that way I don't think and would keep more data.
Hi Caleb,
Thanks for the speedy reply! That sounds good to me and I will try to do it :). Will keep you posted.
Best, Qirong
Hi Caleb,
I modified the script further and rewrote the part to deal with the overlapped region between the forward and reverse strands. Instead of using the position of the base on the read, I used the ref_pos to get the bases from both fwd and rev. To do that:
- assume fwd and rev have the overlapped region.
- overlap_start and overlap_end are retrieved by comparing the reference_start and reference_end between fwd and rev.
- for each ref position, if either fwd or rev has a deletion on that one, take the alternative one.
- if both fwd and rev have the base on that ref position, select the one with higher base quality, or randomly select one.
- return fwd_overlap_use_idx and rev_overlap_use_idx.
The code looks like this:
base comparison between fwd and rev
def findHighQualityBases(fwd_read, rev_read):
# Step 1: Find the start and end ref positions of the overlap
overlap_start = max(fwd_read.reference_start, rev_read.reference_start)
overlap_end = min(fwd_read.reference_end, rev_read.reference_end)
# Ensure there is an overlap
if overlap_start >= overlap_end:
return [], []
# Get aligned pairs for forward and reverse reads
fwd_pairs = fwd_read.get_aligned_pairs(matches_only=True)
rev_pairs = rev_read.get_aligned_pairs(matches_only=True)
# Step 2, 3, 4: Find which read has a higher base quality at each overlapping ref position
fwd_overlap_use_idx = []
rev_overlap_use_idx = []
for ref_pos in range(overlap_start, overlap_end):
fwd_read_pos = next((rp for rp, rpos in fwd_pairs if rpos == ref_pos), None)
rev_read_pos = next((rp for rp, rpos in rev_pairs if rpos == ref_pos), None)
if fwd_read_pos is not None and rev_read_pos is not None:
fwd_base_qual = fwd_read.query_qualities[fwd_read_pos]
rev_base_qual = rev_read.query_qualities[rev_read_pos]
# randomly choose one if the base qualities are the same
# or choose the one with higher base quality
if (
fwd_base_qual == rev_base_qual
and random.choice([True, False])
or fwd_base_qual > rev_base_qual
):
fwd_overlap_use_idx.append(ref_pos)
else:
rev_overlap_use_idx.append(ref_pos)
elif fwd_read_pos is None and rev_read_pos is not None:
rev_overlap_use_idx.append(ref_pos)
elif fwd_read_pos is not None and rev_read_pos is None:
fwd_overlap_use_idx.append(ref_pos)
return fwd_overlap_use_idx, rev_overlap_use_idx
After that, instead of using qpos, it used ref_pos when constructing the fwd_use_idx and rev_use_idx. Also, it takes care of those reads that fwd is covered by rev, or vice versa. alignment quality is checked as previously implemented.
fwd and rev partitioning
# check alignment quality
if fwd_align_qual_read > alignment_quality and rev_align_qual_read > alignment_quality:
overlap_start = max(fwd_read.reference_start, rev_read.reference_start)
overlap_end = min(fwd_read.reference_end, rev_read.reference_end)
# if there is no overlap, use all of fwd and rev
if overlap_start >= overlap_end:
fwd_use_idx = np.arange(fwd_read.reference_start, fwd_read.reference_end)
rev_use_idx = np.arange(rev_read.reference_start, rev_read.reference_end)
# if there is an overlap, use the high quality bases in the overlap region
else:
fwd_overlap_use_idx, rev_overlap_use_idx = findHighQualityBases(fwd_read, rev_read)
# if reverse strand is included in the forward strand, partition the pair into fwd-only, overlap, and fwd-only
if fwd_read.reference_start < rev_read.reference_start and fwd_read.reference_end > rev_read.reference_end:
# merge the exclusive region and use idx in overlap region
fwd_use_idx = np.concatenate([np.arange(fwd_read.reference_start, overlap_start), fwd_overlap_use_idx, np.arange(overlap_end, fwd_read.reference_end)])
rev_use_idx = rev_overlap_use_idx
# if forward strand is included in the reverse strand, partition the pair into rev-only, overlap, and rev-only
elif fwd_read.reference_start > rev_read.reference_start and fwd_read.reference_end < rev_read.reference_end:
fwd_use_idx = fwd_overlap_use_idx
rev_use_idx = np.concatenate([np.arange(rev_read.reference_start, overlap_start), rev_overlap_use_idx, np.arange(overlap_end, rev_read.reference_end)])
else:
# partition the pair into fwd-only, overlap, and rev-only
# merge the exclusive region and use idx in overlap region
fwd_use_idx = np.concatenate([np.arange(fwd_read.reference_start, overlap_start), fwd_overlap_use_idx])
rev_use_idx = np.concatenate([rev_overlap_use_idx, np.arange(overlap_end, rev_read.reference_end)])
elif fwd_align_qual_read <= alignment_quality and rev_align_qual_read <= alignment_quality:
# use none for either
fwd_use_idx = np.array([])
rev_use_idx = np.array([])
elif fwd_align_qual_read > alignment_quality and rev_align_qual_read <= alignment_quality:
# use none of rev and all of fwd
fwd_use_idx = np.arange(fwd_read.reference_start, fwd_read.reference_end)
rev_use_idx = np.array([])
elif fwd_align_qual_read <= alignment_quality and rev_align_qual_read > alignment_quality:
# use all of rev and none of fwd
fwd_use_idx = np.array([])
rev_use_idx = np.arange(rev_read.reference_start, rev_read.reference_end)
Then, I used ref pos instead of qpos for counting and base quality calculation by simply replace pair[0] to pair[1] in your script:
counting
# handle fwd region, use refpos (pair[1]) instead of qpos (pair[0])
fwd_aligned_pairs = fwd_read.get_aligned_pairs(True)
fwd_region = [pair for pair in fwd_aligned_pairs if pair[1] in fwd_use_idx]
for qpos, refpos in fwd_region:
if refpos is not None and fwd_quality[qpos] > base_qual:
if fwd_seq[qpos] == "A":
qualA_fw[refpos] += fwd_quality[qpos]
countsA_fw[refpos] += 1
elif fwd_seq[qpos] == "C":
qualC_fw[refpos] += fwd_quality[qpos]
countsC_fw[refpos] += 1
elif fwd_seq[qpos] == "G":
qualG_fw[refpos] += fwd_quality[qpos]
countsG_fw[refpos] += 1
elif fwd_seq[qpos] == "T":
qualT_fw[refpos] += fwd_quality[qpos]
countsT_fw[refpos] += 1
# handle rev region, use refpos (pair[1]) instead of qpos (pair[0])
rev_aligned_pairs = rev_read.get_aligned_pairs(True)
rev_region = [pair for pair in rev_aligned_pairs if pair[1] in rev_use_idx]
for qpos, refpos in rev_region:
if refpos is not None and rev_quality[qpos] > base_qual:
if rev_seq[qpos] == "A":
qualA_rev[refpos] += rev_quality[qpos]
countsA_rev[refpos] += 1
elif rev_seq[qpos] == "C":
qualC_rev[refpos] += rev_quality[qpos]
countsC_rev[refpos] += 1
elif rev_seq[qpos] == "G":
qualG_rev[refpos] += rev_quality[qpos]
countsG_rev[refpos] += 1
elif rev_seq[qpos] == "T":
qualT_rev[refpos] += rev_quality[qpos]
countsT_rev[refpos] += 1
I uploaded the updated script here sumstatsBP_overlap.zip also. I tested on my end and it worked, but the coverage has somehow dropped a little bit, which is opposite to my expectation. The logic of the code makes sense to me but maybe I missed some obvious thing. What do you think about this modification?
Best, Qirong