pp-sketchlib
pp-sketchlib copied to clipboard
HDF5 file needs to be split into multiple groups
When there gets to be >500k or so sketches in the sketch group performance gets very slow, looks like it's because the metadata cache size isn't large enough: https://forum.hdfgroup.org/t/limit-on-the-number-of-datasets-in-one-group/5892
I think the solution will be to make subgroups 'sketch1', 'sketch2' etc with some block size of sketches in each, say 30k. Just need a bit of care to make sure it's all backwards compatible.
I'm wondering if switching to apache arrow at some point might solve this and #37