NGSpeciesID icon indicating copy to clipboard operation
NGSpeciesID copied to clipboard

Suggestion for [--rc_identity_threshold]

Open edgardomortiz opened this issue 4 years ago • 3 comments

Hello Kristoffer,

I ran NGSpeciesID with the following command:

for file in *.fastq; do
	NGSpeciesID --t 4 --ont --fastq $file \
	--consensus --medaka \
	--abundance_ratio 0.005 \
	--rc_identity_threshold 0.85 \
	--outfolder ../04_ngspeciesid/${file//.fastq/_clusters_medaka} \
	"&>"../04_ngspeciesid/${file//.fastq/.ngspeciesid.medaka.log.txt}
done

And everything worked nicely but in some cases it creates two separate consensus sequences in opposite directions that should have been joined by --rc_identity_threshold 0.85. For example in the picture, the overlapping region of both consensuses has a similarity of 98.7%

2020-06-16 01 49 55 pm

Would it be too hard to implement reverse complement during the initial clustering steps to avoid this kind of result? This would be very useful for the case of fragmented amplicons (and it would save time in general, even with full length amplicons, by avoiding the separate spoa centering of the forward and reverse clusters just to be merged afterwards by --rc_identity_threshold).

edgardomortiz avatar Jun 16 '20 12:06 edgardomortiz

Implementing this during initial clustering is quite an undertaking. However, would it be sufficient for your purpose to do semi-global alignment in the reverse complement step and check the identity based on this semi-global alignment? This should have the same outcome any is much easier to implement.

ksahlin avatar Jun 17 '20 07:06 ksahlin

Yes, that would be awesome!

edgardomortiz avatar Jun 17 '20 07:06 edgardomortiz

To comment further on this, it is actually nice to have an option to fuse clusters which are highly identical when comparing seq1 sense vs seq2 sense rather then only comparing seq1 and reverse complement seq2 in the detect_reverse_complement script in the consensus.py module.

I was getting separate clusters because the primer regions might be ambiguous due to low coverage and low sequencing quality/higher degree of mutations. Thus, I would get multiple clusters which were highly similar when comparing sense vs sense (95%+ identity, only differing in primer regions). This is something which is easy to add (I did it a bit ugly here but it works):

def detect_reverse_complements(centers, rc_identity_threshold):
    filtered_centers = []
    already_removed = set()
    for i, (nr_reads_in_cl, c_id, seq, reads_path) in enumerate(centers):
        if type(reads_path) != list:
            all_reads = [reads_path]
        else:
            all_reads = reads_path

        merged_cluster_id = c_id
        merged_nr_reads = nr_reads_in_cl
        if c_id in already_removed:
            print("has already been merged, skipping")
            continue

        elif i == len(centers) - 1: # last sequence and it is not in already removed
            filtered_centers.append( [merged_nr_reads, c_id, seq, all_reads ] )

        else:
            for j, (nr_reads_in_cl2, c_id2, seq2, reads_path) in enumerate(centers[i+1:]):
                #check for reverse complement matches
                seq2_rc = reverse_complement(seq2)
                seq_aln_rc, seq2_rc_aln, cigar_string_rc, cigar_tuples_rc, alignment_score_rc = parasail_alignment(seq, seq2_rc)
                seq_aln, seq2_aln, cigar_string, cigar_tuples, alignment_score = parasail_alignment(seq, seq2)

                #Determine alignment identity between complement seq1 and reverse complement seq2
                nr_mismatching_pos_rc = len([1 for n1, n2 in zip(seq_aln_rc, seq2_rc_aln) if n1 != n2])
                total_pos_rc = len(seq_aln_rc)
                aln_identity_rc =  (total_pos_rc - nr_mismatching_pos_rc) / float(total_pos_rc)
                
                #Determine alignment identity between complement seq1 and complement seq2
                nr_mismatching_pos = len([1 for n1, n2 in zip(seq_aln, seq2_aln) if n1 != n2])
                total_pos = len(seq_aln)
                aln_identity =  (total_pos - nr_mismatching_pos) / float(total_pos)

                print("Complement and Reverse Complement Identity seq 1 and 2: " + str(aln_identity_rc))
                print("Complement and Complement Identity seq 1 and 2: " + str(aln_identity))

                    if aln_identity_rc >= rc_identity_threshold:
                    print("Detected alignment identidy above threshold for complement vs reverse complement. Keeping center with the most read support and adding rc reads to supporting reads.")
                    merged_nr_reads += nr_reads_in_cl2
                    already_removed.add(c_id2)

                    if type(reads_path) != list:
                        all_reads.append(reads_path)
                    else:
                        for rp in reads_path:
                            all_reads.append(rp)
                
                if aln_identity >= rc_identity_threshold:
                    print("Detected alignment identidy above threshold for complement vs complement. Keeping center with the most read support and adding rc reads to supporting reads.")
                    merged_nr_reads += nr_reads_in_cl2
                    already_removed.add(c_id2)

                    if type(reads_path) != list:
                        all_reads.append(reads_path)
                    else:
                        for rp in reads_path:
                            all_reads.append(rp)

            filtered_centers.append( [merged_nr_reads, c_id, seq, all_reads] )

    print(len(filtered_centers), "consensus formed.")
    return filtered_centers

2tony2 avatar Jun 03 '21 13:06 2tony2