netcdf-c icon indicating copy to clipboard operation
netcdf-c copied to clipboard

NCZarr does not support reading of many zarr files

Open dzenanz opened this issue 2 years ago • 31 comments

I am trying to read sample zarr files provided here: https://github.com/ome/ome-ngff-prototypes#data-availability

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=nczarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=nczarr,file: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=nczarr
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=nczarr: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/cyx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/cyx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tczyx.ome.zarr#mode=zarr,file
free(): double free detected in tcache 2
Aborted (core dumped)
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tyx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tyx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/yx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/yx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/zyx.ome.zarr#mode=zarr,file
free(): double free detected in tcache 2
Aborted (core dumped)
dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/multi-image.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/multi-image.ome.zarr#mode=zarr,file: NetCDF: NCZarr error
dzenan@corista:~/tester$ ncdump -h https://s3.embl.de/i2k-2020/ngff-example-data/v0.4#mode=zarr,file
ncdump: https://s3.embl.de/i2k-2020/ngff-example-data/v0.4#mode=zarr,file: NetCDF: Malformed URL
dzenan@corista:~/tester$ ncdump -h https://s3.embl.de/i2k-2020/ngff-example-data/v0.4
syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <?xml^ version="1.0" encoding="UTF-8"?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>ngff-example-data/v0.4.dds</Key><BucketName>i2k-2020</BucketName><Resource>/i2k-2020/ngff-example-data/v0.4.dds</Resource><RequestId>17096CA8F2A3955C</RequestId><HostId>083e6bd6-5f67-4d7a-bf85-13f52b390395</HostId></Error>
ncdump: https://s3.embl.de/i2k-2020/ngff-example-data/v0.4: NetCDF: file not found
dzenan@corista:~/tester$ ncdump -h s3:https://s3.embl.de/i2k-2020/ngff-example-data/v0.4
ncdump: s3:https://s3.embl.de/i2k-2020/ngff-example-data/v0.4: No such file or directory
dzenan@corista:~/tester$ ncdump -h s3://s3.embl.de/i2k-2020/ngff-example-data/v0.4
ncdump: s3://s3.embl.de/i2k-2020/ngff-example-data/v0.4: NetCDF: Attempt to use feature that was not turned on when netCDF was built.

Am I doing something wrong? I was under impression that NCZarr would read most zarr files. What are the offending features in these examples which break NCZarr?

Ubuntu 22.04.1 LTS, GCC 11.2.0, libnetcdf-dev Version: 1:4.8.1-1.

dzenanz avatar Aug 08 '22 18:08 dzenanz

@DennisHeimbigner do you have some insight?

dzenanz avatar Aug 10 '22 13:08 dzenanz

Sorry for the delay. It appears to be related to the use of '/' as the dimension separator. I was trying to track down exactly what i happening.

DennisHeimbigner avatar Aug 10 '22 17:08 DennisHeimbigner

I was wrong. The problem is this (its complicated).

  1. array "s0" meta-data specifies this:
    • shape is [4,930,1024]
    • _ARRAY_DIMENSIONS are ["c","y","x"]
  2. array "s1" meta-data specifies this:
    • shape is [4,465,512]
    • _ARRAY_DIMENSIONS are ["c","y","x"]

Since both are in the same group (the root group), this is an inconsistency because one variable says y=930 and the other says y=465. You might contact the Xarray community to see how this contradiction should be resolved.

DennisHeimbigner avatar Aug 10 '22 19:08 DennisHeimbigner

OME NGFF specifies that "Metadata about an image can be found under the "multiscales" key in the group-level metadata." As far as I can tell, multiple arrays are designed to have different dimensions to represent different levels of the image pyramid. @joshmoore Am I interpreting this correctly?

This might be contrary to NCZarr's constraints? Is there a way to reconcile this? Or is NCZarr a bad choice for Zarr-backend for NGFF implementation?

dzenanz avatar Aug 10 '22 19:08 dzenanz

I do not know anything about the OME standard, but why do you think it is generating legal Zarr Version 2 files, never mind NCZarr.

DennisHeimbigner avatar Aug 10 '22 19:08 DennisHeimbigner

Since both are in the same group (the root group), this is an inconsistency because one variable says y=930 and the other says y=465. You might contact the Xarray community to see how this contradiction should be resolved.

@DennisHeimbigner: but that should only be the case for mode=nczarr, no? There's no such restriction in Zarr itself.

As a side note, @dzenanz, https://github.com/ome/ngff/pull/114 is intended to play nicely with xarray and nczarr.

joshmoore avatar Aug 10 '22 19:08 joshmoore

It is actually an XArray problem since the _ARRAY_DIMENSIONS attribute is not recognized in Zarr V2 (or has that changed?). But it should be the case that the names assigned to shapes should be consistent across all uses of those dimensions names, correct?

DennisHeimbigner avatar Aug 10 '22 19:08 DennisHeimbigner

It is actually an XArray problem since the _ARRAY_DIMENSIONS attribute is not recognized in Zarr V2 (or has that changed?)

It's definitely true that Zarr v2 is agnostic to _ARRAY_DIMENIONS, but why should that impact mode=zarr filesets?

But it should be the case that the names assigned to shapes should be consistent across all uses of those dimensions names, correct?

Not in Zarr.

joshmoore avatar Aug 10 '22 19:08 joshmoore

I was speaking more about nczarr and Xarray, than Zarr, because Zarr has no dimension names at all. Perhaps the interesting question is: what happens when that file is read by the Xarray package? What values does Xarray assign to the dimension names?

DennisHeimbigner avatar Aug 10 '22 20:08 DennisHeimbigner

I assume that the highest resolution is used by default in Xarray.

dzenanz avatar Aug 10 '22 20:08 dzenanz

In any case, try changing the access url to include the following: ".......#mode=zarr,noxarray,ZZZ" where ZZZ is either file or zip or s3 depending on what storage format you are reading.

DennisHeimbigner avatar Aug 10 '22 20:08 DennisHeimbigner

I assume that the highest resolution is used by default in Xarray.

I would be surprised if that was so. It would be seen as violating shape constraints, I would think.

DennisHeimbigner avatar Aug 10 '22 20:08 DennisHeimbigner

dzenan@corista:~/tester$ /usr/local/bin/ncdump -h file:///media/xeonatorC/Misc/Tester/CBCT_crop1ms.zarr#mode=zarr,noxarray,file
netcdf CBCT_crop1ms {
dimensions:
	_zdim_56 = 56 ;
	_zdim_63 = 63 ;
	_zdim_67 = 67 ;
	_zdim_28 = 28 ;
	_zdim_31 = 31 ;
	_zdim_33 = 33 ;
	_zdim_14 = 14 ;
	_zdim_15 = 15 ;
	_zdim_16 = 16 ;

group: scale0 {
NetCDF: internal library error; Please contact Unidata support
Location: file ; line 1661

dzenanz avatar Aug 10 '22 20:08 dzenanz

@DennisHeimbigner you are right. xarray.open_zarr() yields:

<xarray.Dataset>
Dimensions:  ()
Data variables:
    *empty*

dzenanz avatar Aug 10 '22 20:08 dzenanz

I would suggest two changes to the OME format:

  1. make the dimension-name <-> shape mapping be consistent
  2. Change the "direction" attribute from this: "direction": [ [ 1.0, 0.0, 0.0 ], [ 0.0, 1.0, 0.0 ], [ 0.0, 0.0, 1.0 ] ] to something like this. "direction": {"x": [1.0,0.0,0.0], "y": [0.0, 1.0, 0.0], "z": [0.0, 0.0, 1.0]}

There may be other attributes that also should be changed.

DennisHeimbigner avatar Aug 10 '22 20:08 DennisHeimbigner

I would suggest two changes to the OME format:

Suggestion noted, thanks, @DennisHeimbigner, but I don't think we should open that conversation here.

Leaving nczarr, s3, and http out of the picture for a moment, I'm still struggling to understand whether you think these file-based pure-zarr examples from the original description should be covered by netcdf-c, @DennisHeimbigner:

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tcyx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/cyx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/cyx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tczyx.ome.zarr#mode=zarr,file
free(): double free detected in tcache 2
Aborted (core dumped)

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tyx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/tyx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/yx.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/yx.ome.zarr#mode=zarr,file: NetCDF: NCZarr error

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/zyx.ome.zarr#mode=zarr,file
free(): double free detected in tcache 2
Aborted (core dumped)

dzenan@corista:~/tester$ ncdump -h file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/multi-image.ome.zarr#mode=zarr,file
ncdump: file:///media/xeonatorC/Dev/ITKIOOMEZarrNGFF/v0.4/multi-image.ome.zarr#mode=zarr,file: NetCDF: NCZarr error

In my mind, they should especially if we are pointing the community to this as the C implementation.

joshmoore avatar Aug 11 '22 01:08 joshmoore

....these file-based pure-zarr examples from the original description should be covered by netcdf-c,

Speaking for myself, I say no. The fact that they cannot be read by Xarray makes it clear that having an inconsistent _ARRAY_DIMENSION <-> shape mapping is just wrong, and needs to be fixed.

As for the case of

"direction": [ [ 1.0, 0.0, 0.0 ], [ 0.0, 1.0, 0.0 ], [ 0.0, 0.0, 1.0 ] ]

that is more a matter of opinion and I would really think that using a dictionary instead of an array of arrays provides for more meaningful metadata. GDAL (https://gdal.org/drivers/raster/zarr.html), for instance, has non-standard attribute values whose value is a dictionary. I think that should be the convention for non-standard attributes.

DennisHeimbigner avatar Aug 11 '22 02:08 DennisHeimbigner

Partly fixed by https://github.com/Unidata/netcdf-c/pull/2492

DennisHeimbigner avatar Aug 28 '22 04:08 DennisHeimbigner

Speaking for myself, I say no. The fact that they cannot be read by Xarray makes it clear that having an inconsistent _ARRAY_DIMENSION <-> shape mapping is just wrong, and needs to be fixed.

@DennisHeimbigner, to clarify: is the intent for netcdf-c to be considered a Zarr library (as opposed to an nczarr library)?

Partly fixed by #2492

@dzenanz: will you have a chance to test if the partial fix works for you?

joshmoore avatar Sep 05 '22 14:09 joshmoore

I am finishing up a grant proposal today and tomorrow, I plan to test #2492 later in the week.

dzenanz avatar Sep 05 '22 14:09 dzenanz

,,,to clarify: is the intent for netcdf-c to be considered a Zarr library (as opposed to an nczarr library)?

The primary goal is to read/write netcdf-4 stored in Zarr format datasets. But since netcdf-4 does not exactly map to Zarr v2, it is necessary to add optional extensions to Zarr to support netcdf-4 concepts.

On the other hand, we do not want to cut ourselves off from pure Zarr datasets. So, a sub-goal is to support mapping a subset of netcdf-4 to pure Zarr. This sub-goal allows us to access pure Zarr datasets thru the netCDF API.

DennisHeimbigner avatar Sep 05 '22 19:09 DennisHeimbigner

Thanks for the explanation, @DennisHeimbigner. I definitely understand the primary goal. But is there also a sub-goal of mapping all pure Zarr datasets to some subset of netcdf-4?

joshmoore avatar Sep 06 '22 06:09 joshmoore

...is there also a sub-goal of mapping all pure Zarr datasets to some subset of netcdf-4?

Short answer is yes. For example, I have just added support for Zarr fixed length string type. There are a couple of other things I will be adding when there specification is stable: .zmetadata (aggregate metadata) and sharding. If there is something you would like to see that I have missed, then let me know.

DennisHeimbigner avatar Sep 06 '22 18:09 DennisHeimbigner

I guess I defer there to @dzenanz' testing since in https://github.com/Unidata/netcdf-c/issues/2474#issuecomment-1211459122 there appear to be Zarr datasets that aren't openable. If it would help to copy them somewhere for testing, happy to do so.

joshmoore avatar Sep 06 '22 19:09 joshmoore

Trying to read C:\Dev\ITKIOOMEZarrNGFF\v0.4\cyx.ome.zarr with updated version of netCDF still produces the same error: NetCDF: NCZarr error when invoking nc_open. Same happens for the other files there (yx.ome.zarr and multi-image.ome.zarr). I am using a slightly modified netCDF: https://github.com/dzenanz/netcdf-c/tree/itkSpecific.

dzenanz avatar Sep 08 '22 16:09 dzenanz

That is because the OME use of _ARRAY_DIMENSION is still wrong and I cannot fix that particular problem.

DennisHeimbigner avatar Sep 08 '22 16:09 DennisHeimbigner

Removing either _ARRAY_DIMENSION attributes or leaving just one scale gets me further: NetCDF: Filter error: undefined filter encountered. That is understandable, since the array has:

"compressor": {
    "blocksize": 0,
    "clevel": 5,
    "cname": "lz4",
    "id": "blosc",
    "shuffle": 1
}

dzenanz avatar Sep 08 '22 20:09 dzenanz

How did you install netcdf-c? Are you building it yourself?

DennisHeimbigner avatar Sep 08 '22 23:09 DennisHeimbigner

To have the latest version, I have to build it myself. My build setup is here: https://github.com/InsightSoftwareConsortium/ITKIOOMEZarrNGFF/pull/6 Here, I am trying to add zip support to the build: https://github.com/dzenanz/ITKIOOMEZarrNGFF/commit/0bbea2563204e8bb204387852ec4d41b465ccd01 the current stumbling block is here: https://github.com/dzenanz/ITKIOOMEZarrNGFF/commit/0bbea2563204e8bb204387852ec4d41b465ccd01#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aR135-R137

dzenanz avatar Sep 09 '22 12:09 dzenanz

So it looks like you are building with your own CMakeLists.txt and not using the one provided by netcdf, correct?

DennisHeimbigner avatar Sep 09 '22 17:09 DennisHeimbigner