Ignore low-complexity sequences
Hi, thank you for providing such a useful tool! I was wondering if it's possible to disable reporting tandem repeats originating from low-complexity sequences, such as the following examples:
seq1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA seq2 GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
This might be possible with dust or sdust. Do you have a defination or requirement for low-complexity?
Thanks for the reply, I guess Shannon entropy would be a good measure of complexity, for example, define sequence with Shannon entropy less than 1.5 as low complexity.