Thread safety and the HDF5 error stack in 4.9.3
Further details on this are described in a conda-forge issue I started a while ago, but most of it is me debugging and narrowing things down which I've described below.
Use Case and Disclaimer
I use NetCDF C from netcdf4-python and from a multi-threaded setting, usually via the python dask library. When reading files in parallel it is always through the python xarray library which to my understanding uses locks for different files. However, I originally ran into the below issue when creating new files (one per thread). I understand, although I had forgotten, that these use cases are likely not supported or intended to work, but any details on the state of thread-safety in NetCDF would be interesting to hear.
Related: https://github.com/unidata/netcdf-c/issues/382 https://github.com/unidata/netcdf-c/issues/1373
The Problem
I discovered that when creating a new NetCDF file from a second thread, when the first thread has already initialized the NetCDF C library (and therefore the HDF5 library's error handling), that the HDF5's error stack messages are printed/leaked. These error messages are for expected failures/error cases of the HDF5 library.
Reproducer
Shame: ChatGPT gave me the skeleton of the below code. I'm primarily a python developer.
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <netcdf.h>
#include <unistd.h>
#include <sys/types.h>
#define FILE_PATH "created_file.nc" // Will be created (overwritten if exists)
// Macro to handle NetCDF errors
#define NC_CHECK(call) \
do { \
int retval = call; \
if (retval != NC_NOERR) { \
fprintf(stderr, "NetCDF error: %s\n", nc_strerror(retval)); \
exit(EXIT_FAILURE); \
} \
} while (0)
// Thread function to create a NetCDF-4 file
void* create_netcdf4_file(void* arg) {
const char* path = (const char*)arg;
int ncid;
pid_t tid = gettid();
printf("Thread ID: %d\n", tid);
printf("Thread: Creating NetCDF-4 file: %s\n", path);
NC_CHECK(nc_set_default_format(NC_FORMAT_NETCDF4, NULL));
// Create a new NetCDF-4 file, overwriting if it exists
//NC_CHECK(nc_create(path, NC_NETCDF4 | NC_CLOBBER, &ncid));
NC_CHECK(nc_create(path, NC_CLOBBER, &ncid));
printf("Thread: File created (ID: %d)\n", ncid);
// Close the file
NC_CHECK(nc_close(ncid));
printf("Thread: NetCDF-4 file created and closed successfully.\n");
return NULL;
}
int main() {
pthread_t thread;
pid_t tid = gettid();
printf("Main: Starting thread to create NetCDF-4 file...\n");
printf("Thread ID: %d\n", tid);
nc_rc_set("HTTP.SSL.CAINFO", "");
if (pthread_create(&thread, NULL, create_netcdf4_file, (void*)FILE_PATH) != 0) {
perror("pthread_create");
return EXIT_FAILURE;
}
// Wait for the thread to finish
pthread_join(thread, NULL);
printf("Main: Thread finished.\n");
return EXIT_SUCCESS;
}
Put the above in create_netcdf4_threaded.c and build with:
gcc -o create_netcdf4_threaded create_netcdf4_threaded.c -lnetcdf -lpthread
Then when run make sure to delete the existing file, otherwise the error doesn't show.
rm created_netcdf.nc
./create_netcdf4_threaded
In the output there should be something like:
HDF5-DIAG: Error detected in HDF5 (1.14.6) thread 1:
#000: H5F.c line 496 in H5Fis_accessible(): unable to determine if file is accessible as HDF5
major: File accessibility
minor: Not an HDF5 file
#001: H5VLcallback.c line 3913 in H5VL_file_specific(): file specific failed
major: Virtual Object Layer
minor: Can't operate on object
#002: H5VLcallback.c line 3848 in H5VL__file_specific(): file specific failed
major: Virtual Object Layer
minor: Can't operate on object
#003: H5VLnative_file.c line 344 in H5VL__native_file_specific(): error in HDF5 file check
major: File accessibility
minor: Can't get value
#004: H5Fint.c line 1055 in H5F__is_hdf5(): unable to open file
major: File accessibility
minor: Unable to initialize object
#005: H5FD.c line 788 in H5FD_open(): can't open file
major: Virtual File Layer
minor: Unable to open file
#006: H5FDsec2.c line 324 in H5FD__sec2_open(): unable to open file: name = 'test.nc', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0
major: File accessibility
minor: Unable to open file
This happens in NetCDF C 4.9.3 but not 4.9.2.
Environment
This has been tested on various flavors of linux, using the conda-forge libnetcdf package 4.9.3. I've also built it from source and git bisect'd the change that introduced this. Version 4.9.2 does not see this. Builds have so far been using gcc 15.1 from conda-forge. HDF5 1.14.6 is used from conda-forge but this has also been seen when HDF5 is built from source for debugging.
Bisect
This started happening in https://github.com/unidata/netcdf-c/commit/f37fe57cebe207977380dde6973fcf422ac619cc which is part of https://github.com/Unidata/netcdf-c/pull/2021.
It has been a while since I started looking at this and tracked the actual functions down, but if I remember correctly it is that the NC4/HDF5 initialization functions set the "auto" functions for HDF5's error stack printing to NULL so nothing is printed:
https://github.com/Unidata/netcdf-c/blob/8c5f353f664c33ef5e2b6449bbbf96d32670bac4/libhdf5/hdf5internal.c#L68-L90
But this only gets called once and is not re-initialized in a new thread. In my example script I specifically call nc_rc_set to force this initialization in the main thread. This mimics some calls in the python netcdf4-python library.
The changes in #2021 seem to try to open the specified file for reading even though we're explicitly creating a new file and/or clobbering if it does exist. It uses HDF5 to tell if the file exists or not depending on HDF5 raising an error:
https://github.com/Unidata/netcdf-c/pull/2021/files#diff-1e6e01c6a7a1ed4c38d2bb4760c1f0740ee7dd825c0ba8cac86162bafa50949fR1597-R1600
Questions
So I guess my main question is: how much of this is expected? Given that it didn't happen in 4.9.2 but happens in 4.9.3 I'm hoping that it is unintended and not that I was just getting lucky not hitting it prior to this.
Edit: I should have added, everything works fine. This is just a printed error, but the error is expected by the code in NetCDF and the file is created just fine in the end.
This is interesting, thanks! The short answer is, errors are expected, netCDF is not threadsafe. This is because for a very long time, libhdf5 was not threadsafe, and then after that, the thread safety feature was experimental and limited in ways that I do not recall at the moment. I'm, frankly, not sure what the current state of threadsafety in libhdf5 is. I'm glad everything is working fine, but there is no guarantee that it will; have you validated the data in the files you're writing?
Yes. The files produced from my real world multi-threaded case where individual files are written one-per-thread have not shown any issues. Obviously this could just be a matter of time.
As for this particular 4.9.3 change producing this error stack, I've walked the code (lots of printf) and it really is that this new DAOS object check is saying "I know we're going to clobber this file and create, but let's try opening it for reading anyway". That code fully expects and handles a failure, it is just that HDF5 doesn't know that so it still prints the error stack.
I guess in the short sighted, inexperienced, naive point of view we could imagine that NetCDF may need a per-thread initialization and a global initialization. I don't remember when debugging and researching this issue if HDF5 is meant to preserve error handling callback configuration between threads.
Reference:
https://support.hdfgroup.org/documentation/hdf5/latest/group___h5_e.html#gaf0d6b18cd5160517fe5165b9a8443c69
https://support.hdfgroup.org/documentation/hdf5/latest/_h5_e__u_g.html
However, neither explicitly says that it needs to be called per-thread, but Google's AI overview does say that it is necessary. There's also this h5py-related forum post:
https://forum.hdfgroup.org/t/h5eset-auto-for-all-threads/6305
I don't know anything about these "DAOS objects", but I suppose one way for this specific "extra" error printing that I'm experiencing would be if the DAOS object check didn't happen if we're clobbering/creating a file. However, where this check happens is in the "infer model" portion of the NetCDF code and I don't know what the rules are for that or where the lines are drawn between file creation and model interference.