TideHunter icon indicating copy to clipboard operation
TideHunter copied to clipboard

Ignore low-complexity sequences

Open yiluyucheng opened this issue 10 months ago • 2 comments

Hi, thank you for providing such a useful tool! I was wondering if it's possible to disable reporting tandem repeats originating from low-complexity sequences, such as the following examples:

seq1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA seq2 GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT

yiluyucheng avatar Jul 01 '25 15:07 yiluyucheng

This might be possible with dust or sdust. Do you have a defination or requirement for low-complexity?

yangao07 avatar Jul 03 '25 13:07 yangao07

Thanks for the reply, I guess Shannon entropy would be a good measure of complexity, for example, define sequence with Shannon entropy less than 1.5 as low complexity.

yiluyucheng avatar Jul 04 '25 10:07 yiluyucheng