hdmf icon indicating copy to clipboard operation
hdmf copied to clipboard

improve html representation of datasets

Open h-mayorquin opened this issue 1 year ago • 9 comments

Motivation

Improve the display of the data in the html representation of containers. Note that this PR is focused on datasets that were already written. For in memory representation the issue on what to do with things that are wrapped in an iterator or an DataIO subtype can be addressed in another PR I think.

How to test the behavior?

HDF5

I have been using this script

from pynwb.testing.mock.ecephys import mock_ElectricalSeries
from pynwb.testing.mock.file import mock_NWBFile
from hdmf.backends.hdf5.h5_utils import H5DataIO
from pynwb.testing.mock.ophys import mock_ImagingPlane, mock_TwoPhotonSeries

import numpy as np

data=np.random.rand(500_000, 384)
timestamps = np.arange(500_000)
data = data=H5DataIO(data=data, compression=True, chunks=True)

nwbfile = mock_NWBFile()
electrical_series = mock_ElectricalSeries(data=data, nwbfile=nwbfile, rate=None, timestamps=timestamps)

imaging_plane = mock_ImagingPlane(grid_spacing=[1.0, 1.0], nwbfile=nwbfile)


data = H5DataIO(data=np.random.rand(2, 2, 2), compression=True, chunks=True)
two_photon_series = mock_TwoPhotonSeries(name="TwoPhotonSeries", imaging_plane=imaging_plane, data=data, nwbfile=nwbfile)


# Write to file
from pynwb import NWBHDF5IO
with NWBHDF5IO('ecephys_tutorial.nwb', 'w') as io:
    io.write(nwbfile)



from pynwb import NWBHDF5IO

io = NWBHDF5IO('ecephys_tutorial.nwb', 'r')
nwbfile = io.read()
nwbfile

image

Zarr

from numcodecs import Blosc
from hdmf_zarr import ZarrDataIO
import numpy as np
from pynwb.testing.mock.file import mock_NWBFile
from hdmf_zarr.nwb import NWBZarrIO
import os
import zarr
from numcodecs import Blosc, Delta

from pynwb.testing.mock.ecephys import mock_ElectricalSeries
filters = [Delta(dtype="i4")]

data_with_zarr_data_io = ZarrDataIO(
    data=np.arange(100000000, dtype='i4').reshape(10000, 10000),
    chunks=(1000, 1000),
    compressor=Blosc(cname='zstd', clevel=3, shuffle=Blosc.SHUFFLE),
    # filters=filters,
)

timestamps = np.arange(10000)

data = data_with_zarr_data_io

nwbfile = mock_NWBFile()
electrical_series_name = "ElectricalSeries"
rate = None
electrical_series = mock_ElectricalSeries(name=electrical_series_name, data=data, nwbfile=nwbfile, timestamps=timestamps, rate=None)


path = "zarr_test.nwb.zarr"
absolute_path = os.path.abspath(path)
with NWBZarrIO(path=path, mode="w") as io:
    io.write(nwbfile)
    
from hdmf_zarr.nwb import NWBZarrIO

path = "zarr_test.nwb.zarr"

io = NWBZarrIO(path=path, mode="r")
nwbfile = io.read()
nwbfile


image

Checklist

  • [x] Did you update CHANGELOG.md with your changes?
  • [x] Does the PR clearly describe the problem and the solution?
  • [x] Have you reviewed our Contributing Guide?
  • [x] Does the PR use "Fix #XXX" notation to tell GitHub to close the relevant issue numbered XXX when the PR is merged?

h-mayorquin avatar Apr 19 '24 16:04 h-mayorquin

Codecov Report

Attention: Patch coverage is 87.50000% with 8 lines in your changes missing coverage. Please review.

Project coverage is 89.12%. Comparing base (06a62b9) to head (01f8f8f). Report is 31 commits behind head on dev.

Files with missing lines Patch % Lines
src/hdmf/utils.py 87.50% 2 Missing and 2 partials :warning:
src/hdmf/backends/hdf5/h5tools.py 84.61% 1 Missing and 1 partial :warning:
src/hdmf/container.py 84.61% 1 Missing and 1 partial :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1100      +/-   ##
==========================================
+ Coverage   89.08%   89.12%   +0.03%     
==========================================
  Files          45       45              
  Lines        9890     9944      +54     
  Branches     2816     2825       +9     
==========================================
+ Hits         8811     8863      +52     
+ Misses        763      762       -1     
- Partials      316      319       +3     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Apr 23 '24 15:04 codecov[bot]

OK, I added table formating for hdf5:

image

h-mayorquin avatar Apr 26 '24 23:04 h-mayorquin

@stephprince Concerning the test, yes, I can do it, but, can you helmp to create a container that contains array data? I just don't have experienced with the bare bones object. This is my attempt:

from hdmf.container import Container

container = Container(name="Container")
container.__fields__ = {
    "name": "data",
    "description": "test data",
}

test_data = np.array([1, 2, 3, 4, 5])
setattr(container, "data", test_data)
container.fields

But the data is not added as a field. How can I move forward?

h-mayorquin avatar Apr 26 '24 23:04 h-mayorquin

Related:

https://github.com/hdmf-dev/hdmf-zarr/issues/186

h-mayorquin avatar Apr 29 '24 14:04 h-mayorquin

I added the handling division by zero, check this out what happens with external files (like Video):

image

From this example:

import remfile
import h5py

asset_path = "sub-CSHL049/sub-CSHL049_ses-c99d53e6-c317-4c53-99ba-070b26673ac4_behavior+ecephys+image.nwb"
recording_asset = dandiset.get_asset_by_path(path=asset_path)
url = recording_asset.get_content_url(follow_redirects=True, strip_query=True)
file_path = url

rfile = remfile.File(file_path)
file = h5py.File(rfile, 'r')

from pynwb import NWBHDF5IO

io = NWBHDF5IO(file=file, mode='r')

nwbfile = io.read()
nwbfile

h-mayorquin avatar Apr 30 '24 20:04 h-mayorquin

There are still some failing tests for different python versions, it looks like one of the reasons is because h5py added the .nbytes attribute in version 3.0 and we still have h5py==2.10 as a minimum version.

>       array_size_in_bytes = array.nbytes
E       AttributeError: 'Dataset' object has no attribute 'nbytes'

I'm not sure if there's another way to access that information or if we would just want to optionally display it if available.

stephprince avatar May 01 '24 17:05 stephprince

I'm not sure if there's another way to access that information or if we would just want to optionally display it if available.

Checking if hasattr(data, "nbytes") to optionally seems reasonable to me. In this way you can also avoid custom behavior depending on library versions.

oruebel avatar May 01 '24 17:05 oruebel

@stephprince

I'm not sure if there's another way to access that information or if we would just want to optionally display it if available.

It can be estimated from the dtype and the number of elements. I will do that when the attribute does not exists.

h-mayorquin avatar May 02 '24 15:05 h-mayorquin

@stephprince when you have time, can you review this?

rly avatar Oct 02 '24 20:10 rly

Rereading through this discussion, I believe where we left off is that the we want to remove the backend-specific logic from the Container class. To do so, it was proposed that:

In this PR we:

  • Add HDMFIO.generate_dataset_html(dataset) which would implement a minimalist representation
  • Implement HDF5IO.generate_dataset_html(h5py.Dataset) to represent an h5py.Dataset

In a separate PR on hdmf_zarr we would:

  • implement ZarrIO.generate_dataset_html(Zarr.array)

In the Container class, it would look like this:

read_io = self.get_read_io()  # if the Container was read from file, this will give you the IO object that read it
if read_io is not None:
    html_repr = read_io.generate_dataset_html(my_dataset)
else:
    # The file was not read from disk so the dataset should be numpy array or a list

@h-mayorquin did you want to do this? Otherwise I can go ahead and make the proposed changes to finish up this PR.

stephprince avatar Oct 17 '24 16:10 stephprince

Hi, @stephprince

I think this is a good summary.

I am not sure how to decouple HDF5IO.generate_dataset_html(h5py.Dataset) here as hdmf seems super coupled with hdf5. Or is it the idea that we only want to exclude zarr?

This has been on the back of my mind for a while and everytime but I had other priorities. It would be great if you have time to finish it.

h-mayorquin avatar Oct 17 '24 17:10 h-mayorquin

@h-mayorquin yes I can take a look at it and finish it up

stephprince avatar Oct 17 '24 21:10 stephprince

I have pushed the updates we discussed:

  • added utility functions generate_array_html_repr and get_basic_array_info to the utils module to get basic array info and generate an array html table
  • added a static HDMFIO.generate_dataset_html() method, the HDF5/Zarr implementations collect information from the dataset and then generate the actual html representation
  • updated Container._generate_array_html() to use these methods

I tested a Zarr implementation that looks like this and can submit a PR in hdmf_zarr for that:

def generate_dataset_html(dataset):
    """Generates an html representation for a dataset for the ZarrIO class"""

    # get info from zarr array and generate html repr
    zarr_info_dict = {k:v for k, v in dataset.info_items()}
    repr_html = generate_array_html_repr(zarr_info_dict, dataset, "Zarr Array")

    return repr_html

@oruebel @h-mayorquin if you could please review and let me know if there are any remaining concerns

stephprince avatar Oct 30 '24 21:10 stephprince

Looks good to me, thanks for taking on this.

h-mayorquin avatar Oct 31 '24 21:10 h-mayorquin