mgatk icon indicating copy to clipboard operation
mgatk copied to clipboard

sumstatsBP_overlap.py failed to handle paired-end reads with unmatched length in the overlapped region.

Open Qirong-Lin opened this issue 5 months ago • 3 comments

Describe the bug Hi again! Continue with #85 where I'm running a atac-seq dataset with unequal R1 and R2 read lengths as I trimmed the low-quality area of R1 after sequencing. It throws out a similar error message that the overlapped regions between R1 and R2 have different lengths after correcting the splitting issue.

After checking the result, the sequence length (len(fwd_seq)) is unequal to the distance between the ref start and ref end, probably due to the unmatched indels between R1 and R2 caused by sequencing error.

Additional context I modified the sumstatsBP_overlap.py by adding codes to compare the overlapped length between R1 and R2. I discard the entire read pairs if they don't match, as it's caused by sequencing error and only a few of the reads have this issue. (the file is the same as the one in #85 but try to upload it again: sumstatsBP_overlap.zip )

Same as before, I'm sure this is not the best way but rather a stringent filter to those reads, and super happy if it's possible to discuss more about it.

Best, Qirong

Qirong-Lin avatar Jan 16 '24 14:01 Qirong-Lin

thanks for digging into this. maybe the best way is to take the min of the lengths of R1 and R2 in settings where they are not equal rather than fully discarding? you wouldn't get an index error that way I don't think and would keep more data.

caleblareau avatar Jan 16 '24 14:01 caleblareau

Hi Caleb,

Thanks for the speedy reply! That sounds good to me and I will try to do it :). Will keep you posted.

Best, Qirong

Qirong-Lin avatar Jan 16 '24 16:01 Qirong-Lin

Hi Caleb,

I modified the script further and rewrote the part to deal with the overlapped region between the forward and reverse strands. Instead of using the position of the base on the read, I used the ref_pos to get the bases from both fwd and rev. To do that:

  1. assume fwd and rev have the overlapped region.
  2. overlap_start and overlap_end are retrieved by comparing the reference_start and reference_end between fwd and rev.
  3. for each ref position, if either fwd or rev has a deletion on that one, take the alternative one.
  4. if both fwd and rev have the base on that ref position, select the one with higher base quality, or randomly select one.
  5. return fwd_overlap_use_idx and rev_overlap_use_idx.

The code looks like this:

base comparison between fwd and rev
def findHighQualityBases(fwd_read, rev_read):
	# Step 1: Find the start and end ref positions of the overlap
	overlap_start = max(fwd_read.reference_start, rev_read.reference_start)
	overlap_end = min(fwd_read.reference_end, rev_read.reference_end)

	# Ensure there is an overlap
	if overlap_start >= overlap_end:
		return [], []

	# Get aligned pairs for forward and reverse reads
	fwd_pairs = fwd_read.get_aligned_pairs(matches_only=True)
	rev_pairs = rev_read.get_aligned_pairs(matches_only=True)

	# Step 2, 3, 4: Find which read has a higher base quality at each overlapping ref position
	fwd_overlap_use_idx = []
	rev_overlap_use_idx = []

	for ref_pos in range(overlap_start, overlap_end):
		fwd_read_pos = next((rp for rp, rpos in fwd_pairs if rpos == ref_pos), None)
		rev_read_pos = next((rp for rp, rpos in rev_pairs if rpos == ref_pos), None)

		if fwd_read_pos is not None and rev_read_pos is not None:
			fwd_base_qual = fwd_read.query_qualities[fwd_read_pos]
			rev_base_qual = rev_read.query_qualities[rev_read_pos]

			# randomly choose one if the base qualities are the same
			# or choose the one with higher base quality
			if (
				fwd_base_qual == rev_base_qual
				and random.choice([True, False])
				or fwd_base_qual > rev_base_qual
			):
				fwd_overlap_use_idx.append(ref_pos)
			else:
				rev_overlap_use_idx.append(ref_pos)
		elif fwd_read_pos is None and rev_read_pos is not None:
			rev_overlap_use_idx.append(ref_pos)
		elif fwd_read_pos is not None and rev_read_pos is None:
			fwd_overlap_use_idx.append(ref_pos)

		return fwd_overlap_use_idx, rev_overlap_use_idx

After that, instead of using qpos, it used ref_pos when constructing the fwd_use_idx and rev_use_idx. Also, it takes care of those reads that fwd is covered by rev, or vice versa. alignment quality is checked as previously implemented.

fwd and rev partitioning
	# check alignment quality
	if fwd_align_qual_read > alignment_quality and rev_align_qual_read > alignment_quality:
		overlap_start = max(fwd_read.reference_start, rev_read.reference_start)
		overlap_end = min(fwd_read.reference_end, rev_read.reference_end)	

		# if there is no overlap, use all of fwd and rev
		if overlap_start >= overlap_end:
			fwd_use_idx = np.arange(fwd_read.reference_start, fwd_read.reference_end)
			rev_use_idx = np.arange(rev_read.reference_start, rev_read.reference_end)
		# if there is an overlap, use the high quality bases in the overlap region
		else:
			fwd_overlap_use_idx, rev_overlap_use_idx = findHighQualityBases(fwd_read, rev_read)
			
			# if reverse strand is included in the forward strand, partition the pair into fwd-only, overlap, and fwd-only
			if fwd_read.reference_start < rev_read.reference_start and fwd_read.reference_end > rev_read.reference_end:
				# merge the exclusive region and use idx in overlap region
				fwd_use_idx = np.concatenate([np.arange(fwd_read.reference_start, overlap_start), fwd_overlap_use_idx, np.arange(overlap_end, fwd_read.reference_end)])
				rev_use_idx = rev_overlap_use_idx

			# if forward strand is included in the reverse strand, partition the pair into rev-only, overlap, and rev-only
			elif fwd_read.reference_start > rev_read.reference_start and fwd_read.reference_end < rev_read.reference_end:
				fwd_use_idx = fwd_overlap_use_idx
				rev_use_idx = np.concatenate([np.arange(rev_read.reference_start, overlap_start), rev_overlap_use_idx, np.arange(overlap_end, rev_read.reference_end)])

			else:
			# partition the pair into fwd-only, overlap, and rev-only
			# merge the exclusive region and use idx in overlap region
				fwd_use_idx = np.concatenate([np.arange(fwd_read.reference_start, overlap_start), fwd_overlap_use_idx])
				rev_use_idx = np.concatenate([rev_overlap_use_idx, np.arange(overlap_end, rev_read.reference_end)])

	elif fwd_align_qual_read <= alignment_quality and rev_align_qual_read <= alignment_quality:
		# use none for either
		fwd_use_idx = np.array([])
		rev_use_idx = np.array([])
	
	elif fwd_align_qual_read > alignment_quality and rev_align_qual_read <= alignment_quality:
		# use none of rev and all of fwd
		fwd_use_idx = np.arange(fwd_read.reference_start, fwd_read.reference_end)
		rev_use_idx = np.array([])
	
	elif fwd_align_qual_read <= alignment_quality and rev_align_qual_read > alignment_quality:
		# use all of rev and none of fwd
		fwd_use_idx = np.array([])
		rev_use_idx = np.arange(rev_read.reference_start, rev_read.reference_end)

Then, I used ref pos instead of qpos for counting and base quality calculation by simply replace pair[0] to pair[1] in your script:

counting
# handle fwd region, use refpos (pair[1]) instead of qpos (pair[0])
	fwd_aligned_pairs = fwd_read.get_aligned_pairs(True)
	fwd_region = [pair for pair in fwd_aligned_pairs if pair[1] in fwd_use_idx]
	for qpos, refpos in fwd_region:
		if refpos is not None and fwd_quality[qpos] > base_qual:
			if fwd_seq[qpos] == "A":
				qualA_fw[refpos] += fwd_quality[qpos]
				countsA_fw[refpos] += 1
			elif fwd_seq[qpos] == "C":
				qualC_fw[refpos] += fwd_quality[qpos]
				countsC_fw[refpos] += 1
			elif fwd_seq[qpos] == "G":
				qualG_fw[refpos] += fwd_quality[qpos]
				countsG_fw[refpos] += 1
			elif fwd_seq[qpos] == "T":
				qualT_fw[refpos] += fwd_quality[qpos]
				countsT_fw[refpos] += 1
	
	# handle rev region, use refpos (pair[1]) instead of qpos (pair[0])
	rev_aligned_pairs = rev_read.get_aligned_pairs(True)
	rev_region = [pair for pair in rev_aligned_pairs if pair[1] in rev_use_idx]
	for qpos, refpos in rev_region:
		if refpos is not None and rev_quality[qpos] > base_qual:
			if rev_seq[qpos] == "A":
				qualA_rev[refpos] += rev_quality[qpos]
				countsA_rev[refpos] += 1
			elif rev_seq[qpos] == "C":
				qualC_rev[refpos] += rev_quality[qpos]
				countsC_rev[refpos] += 1
			elif rev_seq[qpos] == "G":
				qualG_rev[refpos] += rev_quality[qpos]
				countsG_rev[refpos] += 1
			elif rev_seq[qpos] == "T":
				qualT_rev[refpos] += rev_quality[qpos]
				countsT_rev[refpos] += 1

I uploaded the updated script here sumstatsBP_overlap.zip also. I tested on my end and it worked, but the coverage has somehow dropped a little bit, which is opposite to my expectation. The logic of the code makes sense to me but maybe I missed some obvious thing. What do you think about this modification?

Best, Qirong

Qirong-Lin avatar Jan 17 '24 18:01 Qirong-Lin