reg-gen icon indicating copy to clipboard operation
reg-gen copied to clipboard

Refactor GenomicRegionSet IO handling

Open fabio-t opened this issue 8 years ago • 2 comments

IO read/write functions should be separated from the actual GRS. This will mean extracting all read_bed, write_bed etc functions and putting them into Format classes that will take a GRS as input and populate it or write it to file as needed.

The following basic classes should be developed, at least:

  • BedFormat: it's the current "default" for GRS. They are strongly coupled, and as such it makes harder to export to different formats. This refactoring will solve this problem.

  • BigBedFormat: it's currently only supported in some of the tools, in a "handcrafted" way. We need a more rational approach for this, especially to support further improvements like having a disk-backed GRS, without loading everything in memory. This would reduce a lot the memory footprint of certain tools (eg, motif analysis).

  • Bed12Format: a more complicate "bed-like" format relevant for, I believe, only RGT-Viz.

To leave for later: improve memory footprint of GenomicRegion so that GRS can be much bigger. ~~Also possibly substitute the internal list for a proper array, to make removal O(1).~~

fabio-t avatar Jul 28 '17 11:07 fabio-t

As @jovesus pointed out, when a GRS is filled from a bed file it's always sorted. Also, duplicate lines are always kept. These two things were there before but I'm not sure if they should be like this.

In general, a GenomicRegionSet is not really a Set. It's wrapper around a List, with List semantics. Just a little quirk.

fabio-t avatar Aug 04 '17 14:08 fabio-t

The basic idea is done. BigBed is still not supported since we have to decide how to handle it. Various ways available:

  • Simple conversion. I already have utility methods in motif analysis to convert from bed to bigbed and viceversa. Every application should know how to change the "score" field to make it fit the 0-1000 range, depending on the meaning such score has. This is simple but yields no advantage.

  • Make a GRSFileIO.BigBed. Instead of converting bed to big bed and viceversa, this would directly write to/read from BigBed files. It has the advantage that it forces us to stop using the Bed utilities (or write a python wrapper), and it should be more efficient than making BED temporary files.

  • Keep a BigBed behind the GRS. This is a significant change and I'm not sure it's worth it. We would gain a lot by improving the memory efficiency of the GenomicRegion, instead of essentially writing a DB layer on top the BigBeds.

fabio-t avatar Aug 04 '17 16:08 fabio-t