cooler icon indicating copy to clipboard operation
cooler copied to clipboard

Deleting bins from scool on append leaves dead space in file

Open bskubi opened this issue 11 months ago • 4 comments

In the HDF5 format, when groups/datasets are deleted, it does not free the space in the file.

I notice that in the create_scool implementation, if the function is called in append mode, then it always deletes the original bins group and recreates it, even if the submitted bins are exactly the same. Calling create_scool multiple times with the same bins, but an empty pixels dict, converts the previous bins datasets into inaccessible dead space within the file, making the file larger and larger even though the actual groups and data within the file is the same.

bskubi avatar Jan 08 '25 23:01 bskubi

This issue arises from how HDF5 handles space when deleting or modifying groups and datasets. When you delete a group or dataset in an HDF5 file, the space occupied by those objects is not automatically freed. Instead, it is marked as free space but remains part of the file. This can cause file size bloat when repeated deletions and additions occur, as you've observed with the create_scool function in append mode. Use File Compression--When adding new data, consider using compression to save space Some HDF5 libraries support automatic compression, reducing the file size even if some space is not reclaimed. Consider Defragmentation--If the file size is a significant concern and space isn't being freed consider using tools like h5repack (for HDF5 files) to compact the file and reclaim unused space.

srinitha709 avatar Jan 09 '25 09:01 srinitha709

Hi, I'm aware of this. It would be easy to avoid causing these issues arising for users of the method in the first place by adding a simple check to verify that the bins are identical to the existing bins and skipping deleting and rewriting them if so. Otherwise, if users find themselves appending to a file many times iteratively, this could necessitate repacking a large .scool file, which could be slow, and the necessity of which could be avoided in the first place by this adjustment to the create_scool method.

I also think it would be useful if the create() method that create_scool calls did not validate that the bins dataframe contains the chrom, start and end columns, since it does not actually use them. Otherwise, if bins besides chrom, start, and end are to be passed for a large collection of cells, then redundant and unused copies of these columns must be passed in distinct dataframes, one per cell, which could use substantial memory. The workaround I'm currently using is to set these columns to None for all but the first cell passed in; I think relaxing this check would make for a nicer interface.

Just food for thought, since based on the design of the code I wasn't sure if you were aware of these somewhat hidden issues! Thanks.

On Thu, Jan 9, 2025 at 1:04 AM srinitha709 @.***> wrote:

This issue arises from how HDF5 handles space when deleting or modifying groups and datasets. When you delete a group or dataset in an HDF5 file, the space occupied by those objects is not automatically freed. Instead, it is marked as free space but remains part of the file. This can cause file size bloat when repeated deletions and additions occur, as you've observed with the create_scool function in append mode. Use File Compression--When adding new data, consider using compression to save space Some HDF5 libraries support automatic compression, reducing the file size even if some space is not reclaimed. Consider Defragmentation--If the file size is a significant concern and space isn't being freed consider using tools like h5repack (for HDF5 files) to compact the file and reclaim unused space.

— Reply to this email directly, view it on GitHub https://github.com/open2c/cooler/issues/451#issuecomment-2579518232, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJF4OGJOPABWTFQAWVKVQKT2JY3TJAVCNFSM6AAAAABU27Z4TCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZZGUYTQMRTGI . You are receiving this because you authored the thread.Message ID: @.***>

-- Warmly, Benjamin Skubi (425) 293-4462

bskubi avatar Jan 10 '25 00:01 bskubi

Thanks for raising this issue @bskubi.

If I understand correctly, you are calling create_scool multiple times on the same file in append mode to add new single cell maps to an existing dataset. That wasn't an anticipated usage pattern, so I see how this issue arose. The contributor who wrote create_scool encoded the input as a dictionary of cell names to pixel dataframes, fully materialized, which is not suitable for large datasets. Enabling the pattern you are using without deleting and recreating the bin table would be a workaround.

Additionally (or alternatively), I've long been meaning to allow an iterator as input, to support incremental appending of single cell maps. Would this kind of solution fit your workflow? Something like:

def contact_map_iter():
    for cell_id in cell_ids:
        pixels = <fetch or generate pixels ... >
        yield cell_id, pixels
 
create_scool("foo.scool", bins, contact_map_iter())

I also think it would be useful if the create() method that create_scool calls did not validate that the bins dataframe contains the chrom, start and end columns, since it does not actually use them.

Good point! Perhaps optional cell-specific bin columns could be rolled into the iterator solution.

nvictus avatar Jan 11 '25 09:01 nvictus

The major issue in the current design as I see it is actually with the bins argument, not the pixels argument. The pixels argument already allows passing an iterator that generates pixels dataframes on the fly for the associated cell.

By contrast the bins dict requires that the values are all pandas dataframes, so all the bins info for the entire dataset must be loaded into memory, including any weights or other bins-associated per-cell values.

The change I think would be best is something like this:

Allow passing a single bins dataframe via the bins argument, setting the chrom/start/end value of the shared bins for all cells.

Then allow passing an iterator that generates (cell id, pixels iterator, cell-specific bins iterator). It would then iterate through the pixels iterator and append them to the per-cell pixels dataset, then iterate through the bins iterator and append them to the per-cell bins dataset.

Does that make sense? Warmly, Benjamin Skubi (425) 293-4462

On Sat, Jan 11, 2025 at 1:18 AM Nezar Abdennur @.***> wrote:

Thanks for raising this issue @bskubi https://github.com/bskubi.

If I understand correctly, you are calling create_scool multiple times on the same file in append mode to add new single cell maps to an existing dataset. That wasn't an anticipated usage pattern, so I see how this issue arose. The contributor who wrote create_scool encoded the input as a dictionary of cell names to pixel dataframes, fully materialized, which is not suitable for large datasets. Enabling the pattern you are using without deleting and recreating the bin table would be a workaround.

Additionally (or alternatively), I've long been meaning to allow an iterator as input, to support incremental appending of single cell maps. Would this kind of solution fit your workflow? Something like:

def contact_map_iter(): for cell_id in cell_ids: pixels = <fetch or generate pixels ... > yield cell_id, pixels create_scool("foo.scool", bins, contact_map_iter())

I also think it would be useful if the create() method that create_scool calls did not validate that the bins dataframe contains the chrom, start and end columns, since it does not actually use them.

Good point! Perhaps optional cell-specific bin columns could be rolled into the iterator solution.

— Reply to this email directly, view it on GitHub https://github.com/open2c/cooler/issues/451#issuecomment-2585169260, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJF4OGIARQKZW74UO6TNJRD2KDOWJAVCNFSM6AAAAABU27Z4TCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOBVGE3DSMRWGA . You are receiving this because you were mentioned.Message ID: @.***>

bskubi avatar Jan 13 '25 05:01 bskubi