PR adding support for Zarr V3
Update: 1/6/2025: Ok, this branch is usable except for MYS2 and Visual Studio.
They are failing on some kind of "out of space" error whose
meaning I cannot figure out
=============================================
Update (12/29/24): I am going ahead and merging the code from PR https://github.com/Unidata/netcdf-c/pull/3066 since that code is pretty easy to follow and I have the best knowledge of what kind of tailoring is needed to adapt to the numerous changes.
=============================================
I am posting this PR as a draft so that Manuel Reis and others can work with it. I will revise this PR description later.
Notes:
- It does not yet run on github actions without errors.
- It appears to have no memory leaks
- It has problems with the testing infrastructure that I am working on.
- CMake does not work for testing v3_nczarr_test because of CMake Policy CMP002.
- Ubuntu + Automake appears to work including testing v3_nczarr_test.
make distcheck does not work because of difficulties with VPATH.It works with distcheck now.
Ok, this branch is usable except for MYS2 and Visual Studio. They are failing on some kind of "out of space" error whose meaning I cannot figure out.
Thanks a lot for including the feature from https://github.com/Unidata/netcdf-c/pull/3066 @DennisHeimbigner and apologies for the delay in the reply!
I just build this branch, rebased on Unidata/main and ran on ncdump -h a huge dataset:
- I see that although the metadata handler is correctly initialized (dataset is consolidated) the format inference happens before that and it walks the storage which takes a lot of time for big datasets on our system. Since the dataset is consolidated (as verified after the walk), it is better to simply check for the existence of any of the zarr metadata files across all versions.
There're many ways to improve this but here's one quick fix I did on my side that improves the time it takes to get the header information of the big dataset. (Edited for breaking tests)
diff --git a/libnczarr/zinfer.c b/libnczarr/zinfer.c
index e11afd6f4..6181b7e32 100644
--- a/libnczarr/zinfer.c
+++ b/libnczarr/zinfer.c
@@ -51,11 +51,11 @@ struct ZarrObjects {
int zarr_version;
int haszmetadata;
} zarrobjects[] = {
-{"zarr.json", ZARRFORMAT3, 0},
-{".zgroup", ZARRFORMAT2, 0},
-{".zarray", ZARRFORMAT2, 0},
-{".zattrs", ZARRFORMAT2, 0},
-{".zmetadata", ZARRFORMAT2, 1},
+{"/zarr.json", ZARRFORMAT3, 0},
+{"/.zgroup", ZARRFORMAT2, 0},
+{"/.zarray", ZARRFORMAT2, 0},
+{"/.zattrs", ZARRFORMAT2, 0},
+{"/.zmetadata", ZARRFORMAT2, 1},
{NULL, 0, 0},
};
@@ -187,7 +187,6 @@ done:
return THROW(stat);
}
-int
NCZ_infer_open_zarr_format(NC_FILE_INFO_T* file)
{
int stat = NC_NOERR;
@@ -198,8 +197,19 @@ NCZ_infer_open_zarr_format(NC_FILE_INFO_T* file)
NC_UNUSED(file);
/* Probe the map for tell-tale objects and dict keys */
-
if(zarrformat == 0) {
+ size_t seglen = 0;
+ struct ZarrObjects *zo = NULL;
+ stat = NC_ENOTZARR; // Until proven otherwise we aren't sure it's a zarr dataset
+ /* We search on the root path for V2 or V3 tags */
+ for (zo = zarrobjects; zo->name; zo++) {
+ if ((stat = nczmap_exists(zfile->map,zo->name)) == NC_NOERR) {
+ zarrformat = zo->zarr_version;
+ break; /* No need to look for more keys */
+ }
+ }
+ }
+ if(zarrformat == 0 || stat != NC_NOERR) {
/* We need to search subtree for a V2 or V3 tag */
switch(stat = nczmap_walk(zfile->map,"/",tagsearch, ¶m)) {
case NC_ENOOBJECT:
@@ -308,7 +318,7 @@ tagsearch(NCZMAP* map, const char* prefix, const char* key, void* param)
if(seglen == 0) return NC_NOERR;
for(zo=zarrobjects;zo->name;zo++) {
- if(strcasecmp(segment,zo->name)==0) {
+ if(strcasecmp(segment,zo->name+1)==0) {
formats->zarrformat = zo->zarr_version;
return NC_NOERR; /* tell walker to stop */
}
The first suggestion was breaking some tests so I added a commit to tackle this in a better way: https://github.com/mannreis/netcdf-c/commit/806c1a71b8cac543cd941a2c5eeb0b10e81475e5 (s3consolidate-patch.txt) This will check for presence of individual metadatafiles and fall back to listing the contents of the storage in case they aren't present on the top level.
Added a fix for issue (https://github.com/Unidata/netcdf-c/issues/3075)
For what it's worth, I'm able to recreate the Not Enough Space failure on my local Windows VM, so I should be able to dive in and see what the issue is more directly.
*** Test zmap create -k file
canonicalfile: C:\Users\wfisher\Desktop\tmp\netcdf-c-DennisHeimbigner-v3plug.dmh\build\alltests_1737496015\nczarr_test\testdir_mapapi\tmp_mapapi.file
url=|file:///cygdrive/C/Users/wfisher/Desktop/tmp/netcdf-c-DennisHeimbigner-v3plug.dmh/build/alltests_1737496015/nczarr_test/testdir_mapapi/tmp_mapapi.file#mode=nczarr,file|
@DennisHeimbigner So there is one issue going on here. It is trying to create the file on the path /cygdrive/c, but I am not in Cygwin (I am in an MSYS2 bash session) and no such path exists. This would explain the 'out of space', if it were trying and unable to access this non-existent path. I'm not sure what heuristic is being used to determine the pathing, I'll dig into it.
Now that you mention I was also facing out of space issues on my machine. Which went away after a bit of clean up and a reboot. Most data was on ~/.cache
Can you run the run_ut_mapapi.sh script with the -x flag set so I can see where the url is being constructed?
Can you run the run_ut_mapapi.sh script with the -x flag set so I can see where the url is being constructed?
@DennisHeimbigner Here you go; I'll let you take a look before I dig into it, but I was going to try to sort this out after a lunch-hour meeting. I initially was poking around in d4pathmgr. c, but since this is only happening in nczarr_test/, that's probably a red herring. This is happening in an MSYS Bash shell, on Github-hosted runners, and on my development VM.
@DennisHeimbigner does anything leap out at you from the log?
Sorry, I got side-tracked. Let me look at it right now.
Ok, I think I see the problem. Let me test it.
I was forgetting to localize the file path.
On msys, this now produces this file path:
D:/a/netcdf-c/netcdf-c/alltests_1737666259/nczarr_test/testdir_mapapi/tmp_mapapi.file
Does this seem correct?
But there appear to be a problem with the corresponding URK, namely:
file://D:/a/netcdf-c/netcdf-c/alltests_1737666259/nczarr_test/testdir_mapapi/tmp_mapapi.file#mode=nczarr,file
Does this URL work on MSYS2?
Or should it be something like:
file:///d/a/netcdf-c/netcdf-c/alltests_1737666259/nczarr_test/testdir_mapapi/tmp_mapapi.file#mode=nczarr,file
@DennisHeimbigner I believe it should be the second, file:///d (...) but I'm double checking. For what it's worth, it should be whatever the other non-nczarr tests are doing, as those shell scripts are working.
@DennisHeimbigner here is a log from before the PR, with the file pathing that works. I am surprised to still see the file:///cygdrive nomenclature, but the pathing does work. It appears that file:/// is the appropriate prefix.
@DennisHeimbigner The fact that the pathing used to work makes me wonder now if it was a double red-herring; I'm looking at the following from the first log I provided:
Pass: create: create: file:///cygdrive/C/Users/wardf/Desktop/tmp/netcdf-c-DennisHeimbigner-v3plug.dmh/build/alltests_1737569215/nczarr_test/testdir_mapapi/tmp_mapapi.file#mode=nczarr,file
Pass: create: defineobj: /.zgroup
Pass: create: close
Pass: create: open: file:///cygdrive/C/Users/wardf/Desktop/tmp/netcdf-c-DennisHeimbigner-v3plug.dmh/build/alltests_1737569215/nczarr_test/testdir_mapapi/tmp_mapapi.file#mode=nczarr,file
Pass: create: exists: /.zgroup
Pass: create: close
+ cdl=ut_mapapi_create_file.cdl
+ ref=ref_ut_mapapi_create.cdl
+ C:/Users/wardf/Desktop/tmp/netcdf-c-DennisHeimbigner-v3plug.dmh/build/nczarr_test/../nczarr_test/zmapio file://tmp_mapapi.file#mode=nczarr,file
Default type: ubyte
Default action: objdump
fail: NetCDF: Some object not found
I see a mix of file path styles here; both file:/// and C:/Users. Is this potentially a problem resulting in NetCDF: Some object not found?
I remember now. Using the cygwin form as intermediary is ok, because when the file is opened by libnetcdf, it first does the localization of the path so you end up with the correct path being passed to open/fopen.
It is not clear where the "object not found" is coming from.
If the file is missing you can test by putting the following command
find ${ISOPATH}
in run_ut_mapapi in the testmapcreate() function.
There is a ````${ZMD}``` command (alias for calling zmapio) in that function.
If you put the above find command just before that ZMD call, it should list
everything in that directory. Then we can see if there is a missing file.
In any case, I don't think any of this affects the "not enough space" problem.
@DennisHeimbigner It makes sense to me that the 'not enough space' results from trying to access an invalid file path, or a path it does not have access to. Is there a way to check if the path exists during localization, and throw a message if it does not?
@DennisHeimbigner I followed your suggestion and added the find ${ISOPATH} command. It is completely empty, which will explain why the objects are failing to be found.
A further update, my previous statement was not correct, I've broken down the script a bit and it appears that the error is happening when zmapio is given the directory; despite the contents being there, it reports a missing object. I've attached the directory below, @DennisHeimbigner can you tell if it is correct/intact? I'm trying to determine if the issue is zmapio reading, or ut_mapapi writing.
The directory looks ok to me. Let me poke around.
In order to get some handle on the role of the path converter, I am building a branch off of master that disables all path conversions.
Despite the hiccups with the tests, I just want to ask @DennisHeimbigner to drop the ZOH changes from this PR (which is not even compiling).
I assume that means
- remove the zoh: protocol
- remove the zoh mode flag
Correct?
Yes. I think a separate PR for that will help with closing this one.
I see the out of space problem is still there.
I had hoped to include this, it feels like we're almost there, but I'm going to get v4.9.3 out and then come back to this for the subsequent v4.10.0 release.
I agree. This PR has been nothing but a disaster and I take full responsibility for it.
@DennisHeimbigner Refreshing my memory on this, the original issue appears to have been fixed by some of the changes I made. Now the failures are somehow related to the string comparison output where character counts differ. See 6875ae3cf974fe34eb3e42c9da452d1a371462a6 and surrounding commits I made to see what I changed.