jhdf icon indicating copy to clipboard operation
jhdf copied to clipboard

Unable to read dataset larger than Integer.MAX_VALUE bytes

Open jbellis opened this issue 1 year ago • 1 comments

Exception: Failed to map data buffer for dataset '/train'
        at org.example.Texmex.lambda$main$3(Texmex.java:110)
        at java.base/java.lang.Thread.run(Thread.java:1623)
Caused by: io.jhdf.exceptions.HdfException: Failed to map data buffer for dataset '/train'
        at io.jhdf.dataset.ContiguousDatasetImpl.getDataBuffer(ContiguousDatasetImpl.java:44)
        at io.jhdf.dataset.DatasetBase.getData(DatasetBase.java:133)
        at org.example.Texmex.computeRecallFor(Texmex.java:70)
        at org.example.Texmex.lambda$main$3(Texmex.java:108)
        ... 1 more
Caused by: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
        at java.base/sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:1185)
        at io.jhdf.storage.HdfFileChannel.mapNoOffset(HdfFileChannel.java:74)
        at io.jhdf.storage.HdfFileChannel.map(HdfFileChannel.java:66)
        at io.jhdf.dataset.ContiguousDatasetImpl.getDataBuffer(ContiguousDatasetImpl.java:40)

The dataset in question is 3848008288 bytes (http://ann-benchmarks.com/deep-image-96-angular.hdf5).

jbellis avatar Jul 30 '23 21:07 jbellis

Thanks for raising this and providing a sample file. This is a limitation currently with contiguous datasets, it would be possible to split the mapping up and read contigious datasets more like chunked datasets are read. This in theory would be a nice way to parrelise the reading as well to gain performance.

In the meantime you could try using the slice reading. Using Dataset#getData(long[] sliceOffset, int[] sliceDimensions).

Some code like seems to work (it is definitly not optimal takes about 30 secs to read on my system)

public class ReadDataset {
	public static void main(String[] args) {
		try (HdfFile hdfFile = new HdfFile(Paths.get("/path/to/deep-image-96-angular.hdf5"))) {
			Dataset dataset = hdfFile.getDatasetByPath("/train");
			int[] dimensions = dataset.getDimensions();
			float[][] data = (float[][]) Array.newInstance(dataset.getJavaType(), dimensions);
			for (int i = 0; i < dimensions[0]; i++) {
				data[i] = ((float[][]) dataset.getData(new long[]{i, 0}, new int[]{1, dimensions[1]}))[0];
			}
			System.out.println("Finished read");
		}
	}
}

jamesmudd avatar Aug 02 '23 09:08 jamesmudd