RepeatMasker icon indicating copy to clipboard operation
RepeatMasker copied to clipboard

Will N have an impact on blast n if RepeatMask uses hard masking and generates N from repeated sequences?

Open helloworldABCD1234 opened this issue 1 year ago • 1 comments

Will N have an impact on blast n if RepeatMask uses hard masking and generates N from repeated sequences? For example, is ATCGGGCTNNTTT the same sequence as ATCGGGCTTTT? Or is it true that ATCGGGCTNNNNTTT and ATCGGGCTTTTT have the same effect in inputting blastn

helloworldABCD1234 avatar Jun 14 '24 10:06 helloworldABCD1234

This is really a question about scoring matrices/gap parameters more than about rmblastn. RepeatMasker uses scoring matrices in which a substitution from N to any other base is slightly penalized (-1 ). This will easily align bases to the Ns for short distances in the cases where they correctly span between two non-N strings, and will terminate alignment if they are too long ( perhaps generating another alignment for the non-N sequence following it ). The gap open/extension penalties also play in to this. They are much higher than the N substitution penalty and therefore will not often span the N's with a gap.

For example, if I use your example and an absurdly low cutoff score, I get the following with a similar matrix/gap parameters:

72 0.00 0.00 0.00 t3 1 11 (0) t1 1 11 (2)

  t3                     1 ATCGGGCTTTT 11
                                   ?? 
  t1                     1 ATCGGGCTNNT 11

Does that answer your question?

rmhubley avatar Jun 14 '24 17:06 rmhubley