reg-gen
reg-gen copied to clipboard
GenomicRegionDict class
Currently a GRS has a list containing GenomicRegions. However, many times we already have subsequences - the most common example is a chromosome. If I'm trying to intersect MPBS regions with, say, extended CpG regions, I should only compare regions on the same chromosome. This can be a considerable speedup.
A GenomicRegionDict would provide an API compatible with GenomicRegionSet (this may require a cleaning up of the the GRS API first, which is long overdue), so that it could be ideally transparently exchangeable.
Example of intersection behaviour:
-
if I try to intersect a GRD with a GRS (or viceversa, which requires a GRS change - but we can also stop this case with an Exception),
intersect
should be called for all GRS within the GRD. The result should be a GRD in the first case (retaining only the keys with at least one intersecting region) and a GRS in the second. Alternative approaches can be evaluated. -
If I try to intersect a GRD with a GRD, common keys should be found and only the corresponding GRS should be intersected, pairwise. The result should be a GRD.
To test the following scenario:
- setup two GRS, with the second having multiple regions for each one of the first, in mixed order and across 20 chromosomes
- sort them both and calculate intersection for each region of the first with all relevant regions of the second GRS. Note the time it takes to sort and to finish this
- make a very simple version of GenomicRegionDict, and use the above-mentioned logic (ie, by chromosome) to do the same job. Try both with the same sorting trick, and without (eg checking all regions of the second GRD for each region of the first, in the correct chromosome of course)
If the above doesn't yield a significant advantage of the GDR over the GRS, it's not worth implementing. Smart usage of list comprehensions and of the C-based intersect function should be applied.