Yggdrasil icon indicating copy to clipboard operation
Yggdrasil copied to clipboard

new NetCDF_jll v400.702.402+0 broken on Windows

Open Alexander-Barth opened this issue 3 years ago • 39 comments

Unfortunately, the new NetCDF_jll v400.702.402+0 does not work on Windows (as far as I know Linux x86_64 and apple M1 are fine).

The new version of NetCDF_jll was created in this commit: https://github.com/JuliaPackaging/Yggdrasil/pull/4481

The errors seem to be related to the upgrade of HDF5_jll v1.12.1+0.

Related bug reports: https://github.com/Alexander-Barth/NCDatasets.jl/issues/164 https://github.com/JuliaGeo/NetCDF.jl/issues/151

This is the full error message when a Windows user (on julia 1.7.2) creates a NetCDF (with HDF5 backend) as reported @visr is below.

Is there a way to test via CI natively the library before releasing them?

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x2374cb3 -- .text at C:\Users\visser_mn\.julia\artifacts\2b6e2ce84250e36811c3019c1ad253c1739c888f\bin\libnetcdf-18.dll (unknown line)
in expression starting at REPL[13]:1
.text at C:\Users\visser_mn\.julia\artifacts\2b6e2ce84250e36811c3019c1ad253c1739c888f\bin\libnetcdf-18.dll (unknown line)
NC4_create at C:\Users\visser_mn\.julia\artifacts\2b6e2ce84250e36811c3019c1ad253c1739c888f\bin\libnetcdf-18.dll (unknown line)
NC_create at C:\Users\visser_mn\.julia\artifacts\2b6e2ce84250e36811c3019c1ad253c1739c888f\bin\libnetcdf-18.dll (unknown line)
nc__create at C:\Users\visser_mn\.julia\artifacts\2b6e2ce84250e36811c3019c1ad253c1739c888f\bin\libnetcdf-18.dll (unknown line)
nc_create at C:\Users\visser_mn\.julia\artifacts\2b6e2ce84250e36811c3019c1ad253c1739c888f\bin\libnetcdf-18.dll (unknown line)
top-level scope at .\REPL[13]:1
jl_toplevel_eval_flex at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:876
jl_toplevel_eval_flex at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:830
jl_toplevel_eval_flex at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:830
jl_toplevel_eval at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:894 [inlined]
jl_toplevel_eval_in at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:944
eval at .\boot.jl:373 [inlined]
eval_user_input at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.7\REPL\src\REPL.jl:150
repl_backend_loop at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.7\REPL\src\REPL.jl:246
start_repl_backend at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.7\REPL\src\REPL.jl:231
#run_repl#47 at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.7\REPL\src\REPL.jl:364
run_repl at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.7\REPL\src\REPL.jl:351
#930 at .\client.jl:394
jfptr_YY.930_36349.clone_1 at C:\Users\visser_mn\.julia\juliaup\julia-1.7.2+0~x64\lib\julia\sys.dll (unknown line)
jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1788 [inlined]
jl_f__call_latest at /cygdrive/c/buildbot/worker/package_win64/build/src\builtins.c:757
#invokelatest#2 at .\essentials.jl:716 [inlined]
invokelatest at .\essentials.jl:714 [inlined]
run_main_repl at .\client.jl:379
exec_options at .\client.jl:309
_start at .\client.jl:495
jfptr__start_21275.clone_1 at C:\Users\visser_mn\.julia\juliaup\julia-1.7.2+0~x64\lib\julia\sys.dll (unknown line)
jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1788 [inlined]
true_main at /cygdrive/c/buildbot/worker/package_win64/build/src\jlapi.c:559
jl_repl_entrypoint at /cygdrive/c/buildbot/worker/package_win64/build/src\jlapi.c:701
mainCRTStartup at /cygdrive/c/buildbot/worker/package_win64/build/cli\loader_exe.c:42
BaseThreadInitThunk at C:\WINDOWS\System32\KERNEL32.DLL (unknown line)
RtlUserThreadStart at C:\WINDOWS\SYSTEM32\ntdll.dll (unknown line)
Allocations: 9681000 (Pool: 9675485; Big: 5515); GC: 13

Alexander-Barth avatar Feb 27 '22 21:02 Alexander-Barth

Smells of upstream bug to me (and given the track record of troubles with this library I'm not even surprised)

giordano avatar Feb 27 '22 21:02 giordano

I retried to make a NetCDF 4.8.1 binary (with HDF5 1.12.1) and we got the same EXCEPTION_ACCESS_VIOLATION on Windows . However if we downgrade HDF5 to 1.12.0 (without Apple M1 support) NetCDF 4.8.1 and 4.7.4 works on Windows. For Linux, all tested combinations seem to work. Is it possible to release different versions for different platforms? Can we yank only the Windows version of NetCDF_jll v400.702.402+0?

Alexander-Barth avatar Mar 03 '22 21:03 Alexander-Barth

Can we yank only the Windows version of NetCDF_jll v400.702.402+0?

There is no concept of platform-specificity in the registry nor the package manager, so that's unrealistic

giordano avatar Mar 03 '22 21:03 giordano

Should this release be yanked then from all platforms? Unfortunately, this will remove the Apple M1 binary, but Windows is pretty common ... (but the real solution we an updated working binary for all platforms).

Alexander-Barth avatar Mar 15 '22 09:03 Alexander-Barth

That may be necessary, yes. But it'd also be great to understand why the Windows build is broken again. We haven't touched the mingw toolchain for quite some time, the hdf5 source is always the same (just a newer version maybe?), version of netcdf source is also the same as the last working version, no?

However, please make sure no dependents of netcdf_jll require the version you want to yank, otherwise you'll make those packages completely broken

giordano avatar Mar 15 '22 09:03 giordano

But it'd also be great to understand why the Windows build is broken again.

It might be related to the HDF5_jll update. HDF5.jl runs fine but maybe an issue with the header files in HDF5_jll? The error does not occur in NetCDF when you do not use the HDF5 backend format.

version of netcdf source is also the same as the last working version, no?

yes, that is exactly the same version of NetCDF sources.

If we yank NetCDF_jll 400.702.402+0 we will fall-back to 400.702.400+0, I guess we will have a problem with the latest version of TempestRemap_jll:

https://github.com/JuliaPackaging/Yggdrasil/commit/59d281664e13dd8f69750e8b268c3396677df791

I did not see any other package incompatible with

 grep -r  NetCDF_jll .
./T/TempestModel_jll/Compat.toml:NetCDF_jll = "400.701.400-400.799"
./T/TempestModel_jll/Deps.toml:NetCDF_jll = "7243133f-43d8-5620-bbf4-c2c921802cf3"
./T/TempestRemap_jll/Compat.toml:NetCDF_jll = "400.701.400-400.799"
./T/TempestRemap_jll/Compat.toml:NetCDF_jll = "400.702.402-400.799"
./T/TempestRemap_jll/Deps.toml:NetCDF_jll = "7243133f-43d8-5620-bbf4-c2c921802cf3"
./I/IOAPI_jll/Compat.toml:NetCDF_jll = "400.701.400-400.799"
./I/IOAPI_jll/Deps.toml:NetCDF_jll = "7243133f-43d8-5620-bbf4-c2c921802cf3"
./N/NetcdfIO/Compat.toml:NetCDF_jll = "400.701.400-400.702.400"
./N/NetcdfIO/Deps.toml:NetCDF_jll = "7243133f-43d8-5620-bbf4-c2c921802cf3"
./N/NetCDF/Compat.toml:NetCDF_jll = "4.7.4-4"
./N/NetCDF/Compat.toml:NetCDF_jll = "400.701.400-400"
./N/NetCDF/Deps.toml:NetCDF_jll = "7243133f-43d8-5620-bbf4-c2c921802cf3"
./N/NCDatasets/Compat.toml:NetCDF_jll = "4.7.4-4"
./N/NCDatasets/Compat.toml:NetCDF_jll = "400.701.400-400"
./N/NCDatasets/Compat.toml:NetCDF_jll = ["400.701.400", "400.702.400"]
./N/NCDatasets/Deps.toml:NetCDF_jll = "7243133f-43d8-5620-bbf4-c2c921802cf3"
./N/NetCDFF_jll/Compat.toml:NetCDF_jll = "400.701.400-400.799"
./N/NetCDFF_jll/Deps.toml:NetCDF_jll = "7243133f-43d8-5620-bbf4-c2c921802cf3"
./N/NetCDF_jll/Package.toml:name = "NetCDF_jll"
./N/NetCDF_jll/Package.toml:repo = "https://github.com/JuliaBinaryWrappers/NetCDF_jll.jl.git"
./M/MDAL_jll/Compat.toml:NetCDF_jll = "400.701.400-400.799"
./M/MDAL_jll/Deps.toml:NetCDF_jll = "7243133f-43d8-5620-bbf4-c2c921802cf3"
./V/VMEC_jll/Compat.toml:NetCDF_jll = "400.702.400-400.799"
./V/VMEC_jll/Deps.toml:NetCDF_jll = "7243133f-43d8-5620-bbf4-c2c921802cf3"
./Registry.toml:7243133f-43d8-5620-bbf4-c2c921802cf3 = { name = "NetCDF_jll", path = "N/NetCDF_jll" }

Alexander-Barth avatar Mar 15 '22 11:03 Alexander-Barth

It might be related to the HDF5_jll update.

The funny thing is that Windows is the only platform for which we always used the same source: the msys2 libraries.

giordano avatar Mar 15 '22 11:03 giordano

Yes, indeed. I was wondering if there is reason to use msys2 over conda-forge on Windows for HDF5...

Alexander-Barth avatar Mar 15 '22 11:03 Alexander-Barth

Msys2 is probably more consistent with the toolchain we use here (mingw) and we still need to provide the runtime dependencies. You'd need to investigate whether conda-forge build for Windows is compatible with our other libraries.

giordano avatar Mar 15 '22 11:03 giordano

You are right, conda-forge uses the Windows Visual C/C++ compiler (https://conda-forge.org/docs/maintainer/knowledge_base.html)

Given that TempestRemap_jll depends on NetCDF_jll 400.702.402+0, does this means that we out of luck with the yanking this NetCDF_jll version?

Unfortunately, the diffs in the headers file quite massive (for a patch release!):

diff  HDF5.v1.12.0.x86_64-w64-mingw32/include/  HDF5.v1.12.1.x86_64-w64-mingw32/include/ | wc -l
# 44843

Alexander-Barth avatar Mar 15 '22 12:03 Alexander-Barth

As an alternative to yanking this version, could we also release a NetCDF 4.8.1 binary (which I got to build after applying several patches) with the "old" HDF5 1.12.0 ?

https://github.com/Alexander-Barth/NCDatasets.jl/issues/165#issuecomment-1057875784

Alexander-Barth avatar Mar 15 '22 13:03 Alexander-Barth

With HDF5 1.12.2 from https://packages.msys2.org/package/mingw-w64-x86_64-hdf5, I don't see this error anymore.

I guess we would be able to upgrade HDF5 to 1.12.2 once this is merged:

https://github.com/conda-forge/hdf5-feedstock/pull/175

Alexander-Barth avatar Jun 14 '22 10:06 Alexander-Barth

That's great news! Only 10 pending reviewers I see, haha.

visr avatar Jun 14 '22 11:06 visr

So I helped a bit to land https://github.com/conda-forge/hdf5-feedstock/pull/175, updating HDF5_jll to 1.12.2 in #5248. This however needed to be yanked from the registry again after finding out that #5249 conda-forge now builds against a libcurl version that is too new for us.

Does anyone have ideas how to get our hands on HDF5 1.12.2 builds for these platforms:

https://github.com/JuliaPackaging/Yggdrasil/blob/ed865e2a88bd56aed37b1f6a1f68b71c695a0926/H/HDF5/build_tarballs.jl#L20-L25

See also https://github.com/JuliaPackaging/Yggdrasil/pull/5249#issuecomment-1198008293, in case we can use conda infrastructure.

visr avatar Jul 28 '22 13:07 visr

@visr thanks a lot for your work in updating HDF5 1.12.2. Too bad that we have now this libcurl issue, apparently only on MacOS). (I had a similar issue recently: https://github.com/JuliaPackaging/Yggdrasil/issues/5031, but I think it is unrelated).

I am not sure how to proceed. What about these possibilities:

  1. ship HDF5 1.12.1 (on Linux/MacOS) and pretending it is HDF5 1.12.2 and ship the actual HDF5 1.12.2 version for Windows (fixing NetCDF for Windows users)
  2. make native builds of HDF5 on Linux x86_64 and Linux i686 with BinaryBuilder, setup an external github action to make a native build for MacOS X x86_64 (which could also be done for Windows) and rely on user contributions for MacOS aarch64 and Linux aarch64. Linux aarch64 could also be emulated via qemu.

Would option 1 or 2 have a chance to get accepted?

Alexander-Barth avatar Jul 28 '22 20:07 Alexander-Barth

For me personally (1) sounds like a good quick fix that I had not thought of. Since the difference is only a patch release it might be acceptable, though I'm curious to see what @giordano thinks. Here are the patch release notes: https://www.hdfgroup.org/2022/04/release-of-hdf5-1-12-2-newsletter-183/.

(2) sounds like a good medium term solution to get more platforms supported. Will probably be some work to organize though. Using julia's buildkite for this might make it easier to get more platforms in one setup. I wonder how that effort compares to getting HDF5 to cross compile at this point (different skills though).

visr avatar Jul 28 '22 21:07 visr

I'm not a huge fan of either solution (especially mixing and lying about version numbers), but hey, don't we lie all the time? (but at least we don't usually mix different versions....). My problem with 2 is who's going to maintain that? Certaintly not me, I can barely keep up with one project, another one using tools I'm completely unfamiliar with is out of reach for me at the moment (moment which will last fairly long). If it's someone else who ensures everything works fine and here we only need to click on merge, then it's ok.

apparently only on MacOS

If you want to understand what the error is about: https://github.com/giordano/macos-compatibility-version

giordano avatar Jul 28 '22 22:07 giordano

Couldn't we edit the registry to change the compat of NetCDF.jl? We could add a compat entry for HDF5_jll v1.12.0.

mkitti avatar Jul 29 '22 01:07 mkitti

Also note The HDF Group release schedule: https://github.com/HDFGroup/hdf5/blob/develop/doc/img/release-schedule.png

HDF5 v1.12.3 should be the last patch release of the v1.12 minor version series. After that they will move to v1.14 as the current stable release. Only the v1.10 release will continue to receive patches.

mkitti avatar Jul 29 '22 01:07 mkitti

I submitted option 1 from @Alexander-Barth in #5251. Nobody loves this solution I think, but it should give us a build that we can work with for now, buying us time to get to better solutions.

visr avatar Jul 29 '22 10:07 visr

For the record: I tried to compile a Linux binary for HDF5 within binary builder (first part of option 2). Unfortunately, it turned out more complicated than I thought (as usual). In fact, the build system uses x86_64-pc-linux-musl and the target is x86_64-linux-gnu, so the build system considers this as cross-compilation HDF5 and fails with:

checking maximum decimal precision for C... configure: error: in `/workspace/srcdir/hdf5-1.12.2':
configure: error: cannot run test program while cross compiling

(even when setting export PAC_C_MAX_REAL_PRECISION=33, value from native compilation).

I have seen that conda-forge was able to cross-compile HDF5 for MacOS - aarch64. I guess that this patch would be allow us to by-pass this configure test:

https://github.com/conda-forge/hdf5-feedstock/blob/main/recipe/patches/osx_cross_configure.patch

Unfortunately, this patch is quite complicated and against an automatically generated file (configure not configure.ac).

Alexander-Barth avatar Jul 29 '22 13:07 Alexander-Barth

Yes, cross compilation is challenging with HDF5. The main issue is that it requires one to obtain configuration from the target platform by executing test programs on that platform. Last time I looked into this, I think could be done via the CMakeCache. It probably should be changed so that we can figure out the calculation by just compiling a program since the compiler already knows many of these configuration details.

See https://forum.hdfgroup.org/t/cross-compiling-for-windows/6735/6 https://github.com/stevengj/hdf5

mkitti avatar Jul 29 '22 16:07 mkitti

I tried to compile a sample HDF5 program from https://github.com/HDFGroup/hdf5-examples within BinaryBuilder , but I got a segmentation fault. Here are my steps:

From a BinaryBuilder session (with mingw64 target):

sandbox:${WORKSPACE}/srcdir/netcdf-c-4.9.0 # wget https://raw.githubusercontent.com/HDFGroup/hdf5-examples/master/C/H5T/h5ex_t_array.c
sandbox:${WORKSPACE}/srcdir/netcdf-c-4.9.0 # gcc -o h5ex_t_array.exe -I/workspace/destdir/include/ h5ex_t_array.c -L/workspace/destdir/bin/  -lhdf5-0
sandbox:${WORKSPACE}/srcdir/netcdf-c-4.9.0 # sha256sum /workspace/destdir/bin/zlib1.dll /workspace/destdir/bin/libsz.dll /workspace/destdir/bin/libhdf5-0.dll
a26b41bb482967b170453c93edf8f108052ab00f0c7d1134761f625c085f175e  /workspace/destdir/bin/zlib1.dll
07e014276e614e91ff1ff55e6e3b465e1d03f736aa38b408f07c3159416060e8  /workspace/destdir/bin/libsz.dll
c2f5d5c789396d7b8f68eeff683433ade42e8feb1832aefe04d921ebe8b85470  /workspace/destdir/bin/libhdf5-0.dll

I transferred the binary h5ex_t_array.exe and the 3 dlls (zlib1.dll, libsz.dll, libhdf5-0.dll) to a Windows system (msys shell):

$ ldd ./h5ex_t_array.exe
        ntdll.dll => /c/WINDOWS/SYSTEM32/ntdll.dll (0x7ff939f10000)
        KERNEL32.DLL => /c/WINDOWS/System32/KERNEL32.DLL (0x7ff938ad0000)
        KERNELBASE.dll => /c/WINDOWS/System32/KERNELBASE.dll (0x7ff9379d0000)
        msvcrt.dll => /c/WINDOWS/System32/msvcrt.dll (0x7ff939900000)
        libhdf5-0.dll => /home/Alexander Barth/libhdf5-0.dll (0x7ff90e180000)   # <- from BinaryBuilder
        ADVAPI32.dll => /c/WINDOWS/System32/ADVAPI32.dll (0x7ff939e20000)
        sechost.dll => /c/WINDOWS/System32/sechost.dll (0x7ff938c40000)
        RPCRT4.dll => /c/WINDOWS/System32/RPCRT4.dll (0x7ff9380c0000)
        zlib1.dll => /home/Alexander Barth/zlib1.dll (0x7ff9328a0000)   # <- from BinaryBuilder
        libwinpthread-1.dll => /mingw64/bin/libwinpthread-1.dll (0x7ff931ba0000)
        libsz.dll => /home/Alexander Barth/libsz.dll (0x7ff931a90000)  # <- from BinaryBuilder

$ ./h5ex_t_array.exe
Segmentation fault

$ sha256sum zlib1.dll libsz.dll libhdf5-0.dll
a26b41bb482967b170453c93edf8f108052ab00f0c7d1134761f625c085f175e *zlib1.dll
07e014276e614e91ff1ff55e6e3b465e1d03f736aa38b408f07c3159416060e8 *libsz.dll
c2f5d5c789396d7b8f68eeff683433ade42e8feb1832aefe04d921ebe8b85470 *libhdf5-0.dll

The example does work on my Linux system and on Windows when compiled natively with MSYS2. Should this test not also succeed on Windows with cross-compilation using BinaryBuilder?

Alexander-Barth avatar Aug 06 '22 00:08 Alexander-Barth

If I also extract libwinpthread-1.dll from BinaryBuilder, the program ./h5ex_t_array.exe does no more return an error (but there is no screen output, unlike native compilation). A output file h5ex_t_array.h5 is created, but it is too small and not readable:

$ h5dump h5ex_t_array.h5
h5dump error: unable to open file "h5ex_t_array.h5"

The example program seems to abort at this line: https://github.com/HDFGroup/hdf5-examples/blob/master/C/H5T/h5ex_t_array.c#L51

Alexander-Barth avatar Aug 06 '22 01:08 Alexander-Barth

I'm getting confused. The HDF5 libraries are coming from msys2:

https://github.com/JuliaPackaging/Yggdrasil/blob/f81af38618619593c4aa3591e6e5e148c1401559/H/HDF5/build_tarballs.jl#L15-L18

Shouldn't you be able to grab that package from msys2 and compile within msys2?

mkitti avatar Aug 06 '22 01:08 mkitti

It is this package exactly: https://packages.msys2.org/package/mingw-w64-x86_64-hdf5

mkitti avatar Aug 06 '22 01:08 mkitti

Yes, I am confused too, HDF5/NetCDF binaries has been a never-ending stream of moments of confusion :-)

The checksums of the HDF5 lib for native compilation in MSYS and BinaryBuilder are identical. I included all the steps, because maybe there is a problem how I tested it (I also run the exe from a cmd shell to avoid any interaction with my local MSYS installation). Maybe somebody has the time to produce it.

I mentioned a similar problem here: https://github.com/Alexander-Barth/NCDatasets.jl/issues/164#issuecomment-1202094798 Native compilation in MSYS of NetCDF with HDF5 worked but it fails with cross-compilation in BinaryBuilder.

Version of GCC is different (12.1 in MSYS, 4.8.5 in BinaryBuilder)...

The error is also reproducible with this smaller example https://github.com/HDFGroup/hdf5-examples/blob/master/C/H5T/h5ex_t_int.c

This example stops at H5Dcreate when running a cross-compiled binary: https://github.com/HDFGroup/hdf5-examples/blob/master/C/H5T/h5ex_t_int.c#L55

Alexander-Barth avatar Aug 06 '22 12:08 Alexander-Barth

Shouldn't you be able to grab that package from msys2 and compile within msys2?

Yes, I installed this library using the package manager of MSYS (pacman).

Alexander-Barth avatar Aug 06 '22 12:08 Alexander-Barth

We can specify a GCC version in BinaryBuilder.

What I would like to know is if you can compile and run the example completely within msys2. If so, then what is the difference between running it within msys2 and outside msys2?

mkitti avatar Aug 06 '22 17:08 mkitti

We can specify a GCC version in BinaryBuilder.

~~I see gcc-6 and gcc-7 in the build image. Can we use a more recent version than gcc 7.5.0?~~ How can we do that?

What I would like to know is if you can compile and run the example completely within msys2.

Yes, this what I referred as native compilation before.

If so, then what is the difference between running it within msys2 and outside msys2?

The difference from what I have seen, is that when you run the binary within msys for any missing DLL (like libwinpthread-1.dll), the DLL at the standard location from MSYS will be used while when you run it outside of MSYS2 an error message is produced. Within MSYS you might get silently get an incompatible (or at least different) DLL.

I run only the cross-compiled version outside of MSYS.

Alexander-Barth avatar Aug 06 '22 19:08 Alexander-Barth