netcdf-c
netcdf-c copied to clipboard
HDF5 stride has different semantics than netcdf stride semantics wrt unlimited.
In PR https://github.com/Unidata/netcdf-c/pull/1001 we converted the netcdf-4 code to use the HDF5 hyperslab operators to speed up strided access (stride > 1). This replaces an older element-at-a-time code implementing strided access.
It appears that the HDF5 stride semantics did not exactly match the older stride semantics in (at least) the following case:
- A variable has an unlimited dimension
- The space allocated in the HDF5 dimension for this variable is smaller than the current (maximum) UNLIMITED value.
- A vars (stride > 1) read is performed that accessed values past the allocated space for the variable, but inside the space size implied by unlimited.
- For some reason, this causes the last legal value to be misread even though it has a defined value. Instead, the fillvalue is returned.
The attached set of files shows this. If you use h5dump on the .nc file, it shows that the temperature variable has 50 defined values. However, ncdump on that file shows the temperature variable has the unlimited size of 55 values. So the netcdf program ncRepro_WeirdStrideIssue.c attempts to read values past the 50 defined values and somehow the netcdf-c library screws up. The other program is a pure HDF5 program that attempts to do the same thing, but if the count is set to 19 (as in the netcdf program) the HDF5 program fails with an index error.
The fix for this could be ugly.
I would like to fix this.
To elaborate further, the performance speedup is important, and this is all part of libhdf5.
Ok, if you have the time. My thinking was to look at the vara code to see what it does in a similar situation, and assuming vara works correctlry, there should be some code fragments that handle the unlimited case that we can transfer to the vars code.
I know what the vara code does. It finds all other datasets that share that unlimited dimension, and finds the maximum extent of the dimension. Then it fakes fill data as needed to support the unlimited dim.
I will take a look at this within the next week and hopefully have a PR with the fix.
Sounds right. Thanks.
Ellen- There may be a temporary workaround. In the file libhdf5/hdf5dispatch.c, at about line 58 change the two lines: NC4_get_vars, NC4_put_vars, to NCDEFAULT_get_vars, NCDEFAULT_put_vars,
This will revert the code to the previous vars code. However, be warned that there will be a significant performance penalty.
OK, I have managed to duplicate this error in a simple test...
Hi guys -- is there any update on this? We discovered this bug during an upgrade to netcdf and are blocked until this is resolved, as this bug results in loss of data in this particular workflow. Thanks!
I could not get a test that reproduced the error. Do you have such a test?
On Tue, Jul 9, 2019 at 10:25 AM ellenjohnson [email protected] wrote:
Hi guys -- is there any update on this? We discovered this bug during an upgrade to netcdf and are blocked until this is resolved, as this bug results in loss of data in this particular workflow. Thanks!
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/1380?email_source=notifications&email_token=ABCSXXCWEHFKY7W7IT5EFZDP6S3YLA5CNFSM4HDEV2UKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZQZPRY#issuecomment-509712327, or mute the thread https://github.com/notifications/unsubscribe-auth/ABCSXXDNTTWHKYMPS6D5CDLP6S3YLANCNFSM4HDEV2UA .
Hi Ed -- I thought you were able to repro the error in a test (your comment from April 6). If not, then see the files.zip attachment in the initial report. Dennis kindly created this issue after I emailed Dennis and Ward to report the bug and I included a standalone C program to reproduce it. The zip archive contains my ncRepro_WeirdStrideIssue.c program plus test file nctest_netcdf4_classic.nc. It also contains tst_h_stride.c which Dennis created to repro it using pure hdf5 code.
Any updates on this? Ed, did you see my reply about the repro code? Question: In Dennis' comments, he mentioned the refactoring that seems to have caused this bug (converting the netcdf-4 code to use HDF5 hyperslab for performance). Was this introduced in 4.6.2 or 4.6.3?
According to the release notes, it was added in 4.6.2-rc1.
I am looking at this again now...
OK, @DennisHeimbigner I think you attached the wrong zip file, the one attached is the one related to the endianness issue. So I could still use some help with C code that reproduces this problem...
Sorry, let me attach the correct file. stridetest.zip
I've been pinged by the Mathworks team on this; before I dive into it, I want to see if anybody has any updates that aren't reflected in the issue?
Yes, I started to look at this. Seeing as you have identified this as important for the next release, I will take a look first thing Monday morning and see if I can come up with an answer...
This issue has been raised again: https://github.com/Unidata/netcdf-c/issues/1877
Hello everyone! Is there any update on this?
Dennis, i see you mentioned https://github.com/Unidata/netcdf-c/issues/1877. That's a performance issue. This issue is with a loss of data -- the last value of the strided read is not being retrieved.
Ellen- did you ever verify that the suggestion in this comment above fixed the issue?
https://github.com/Unidata/netcdf-c/issues/1380#issuecomment-480411272
Hi Dennis and all -- we're revisiting this issue again and I realized I never responded to your question of whether I tried the workaround you suggested:
There may be a temporary workaround. In the file libhdf5/hdf5dispatch.c, at about line 58 change the two lines:
NC4_get_vars,
NC4_put_vars,
to
NCDEFAULT_get_vars,
NCDEFAULT_put_vars,
This will revert the code to the previous vars code. However, be warned that there will be a significant performance penalty.
I did try this when I was upgrading to 4.7.1, and it didn't fix the issue.
We can try this again with our recently upgraded 4.8.1, but if this temp fix works, I'm concerned about the performance penalty. While we try this, do you have any updates about this? Thanks!
I am surprised that that work-around did not work. That code was in place for a long time. It means I do not understand the problem at all. Also, that temporary fix does indeed have a significant performance penalty.
Thanks Dennis. We're testing your temp workaround now with 4.8.1 just to verify again it doesn't fix the problem. Stay tuned.
Hi Dennis,
Thanks for telling us the workaround to this issue.
I tried out your suggestion in the 4.8.1 version of the NetCDF library and it works. The data loss doesn't occur.
You did mention there is a significant performance penalty when the patch is applied. We'd like to know if you have plans for addressing the performance issue in a coming release.