question on overlapping intervals in bed file
Hello -
Great package! Using default parameters, I noticed that the bed file produced has some intervals overlapping others. Is there a way to avoid this? Example from the first two lines here:
AW_ScYP8k3204HRSCAF325_RagTag 0 902 AAAAAGGGGG AW_ScYP8k3204HRSCAF325_RagTag 883 1614 AAAAAAAGGGAAAAAAGGGG
Thanks for any advice.
Hello,
Thank you for using ULTRA and reaching out! We generally try to avoid overlapping annotations where possible, although there are a handful of edge cases. Here, the change in pattern is fairly subtle, and at the boundaries of the two repeats it is difficult to precisely delineate where one ends and the other begins. For larger period repeats especially, labeling these fuzzy transitions as being exactly one pattern or exactly another pattern would be arbitrary and may misguide users. In these fuzzy edge cases we allow for the later repeat to overlap the earlier repeat for (up to) one repetitive unit.
If you require strictly non-overlapping repeat annotations I can write a patch that will enable that functionality, although I would need to know a little bit more about your use-case and exactly how you would like this sort of overlapping annotation to be resolved.
Thanks, Daniel
Thanks very much Daniel for responding! I think the main reason to have non-overlapping bed files might be if one were trying to estimate how much of a scaffold is tandemly 'repetitive' in the broadest sense, in the same way that non-overlapping bed files would be useful for any sort of annotation, searching for enrichment, or whatever. It would be useful to see how arbitrarily resolving long repeats might change the results of such a summary - or maybe you have an example in your paper. My guess is that it would not be too catastrophic. For example, I just calculated that across my 1.3 Gb bird genome, ULTRA finds a total of ~ 89 Mb of tandem repeat, and the bed file records about ~ 1.4 Mb of overlap between annotated repeats within contigs/scaffolds. That's not that much and so arbitrarily resolving them may not make that much of a difference. But I am sure it depends on the individual genome, etc. For now I am simply getting a feel for how much overlap there is so that I can put my results in context. It's quite fascinating though - ULTRA finds some tandem repeats that RepeatMasker and Satellite Repeat Finder miss, so it's an important addition to our toolkit. I will place a different but related query in a separate box to follow.
In earlier ULTRA development I found that there were more diabolical edge cases than I would like, especially when using ULTRA with a higher maximum repetitive period. I have a good solution in mind that I believe will fix the overlap problem without introducing truncation problems, although this solution would require more work than I am able to put into ULTRA at the moment.
I do see value in having output with zero overlap though, and so I believe I'll add a --trim_overlaps flag that uses a more simplistic approach to resolve overlapping annotations. It won't be perfect, but it may still be helpful. I'll leave this issue open until the flag is live in the repo
Great, thanks Daniel - very helpful (and no rush!)