netcdf-c icon indicating copy to clipboard operation
netcdf-c copied to clipboard

HDF5 stride has different semantics than netcdf stride semantics wrt unlimited.

Open DennisHeimbigner opened this issue 5 years ago • 24 comments

In PR https://github.com/Unidata/netcdf-c/pull/1001 we converted the netcdf-4 code to use the HDF5 hyperslab operators to speed up strided access (stride > 1). This replaces an older element-at-a-time code implementing strided access.

It appears that the HDF5 stride semantics did not exactly match the older stride semantics in (at least) the following case:

  1. A variable has an unlimited dimension
  2. The space allocated in the HDF5 dimension for this variable is smaller than the current (maximum) UNLIMITED value.
  3. A vars (stride > 1) read is performed that accessed values past the allocated space for the variable, but inside the space size implied by unlimited.
  4. For some reason, this causes the last legal value to be misread even though it has a defined value. Instead, the fillvalue is returned.

The attached set of files shows this. If you use h5dump on the .nc file, it shows that the temperature variable has 50 defined values. However, ncdump on that file shows the temperature variable has the unlimited size of 55 values. So the netcdf program ncRepro_WeirdStrideIssue.c attempts to read values past the 50 defined values and somehow the netcdf-c library screws up. The other program is a pure HDF5 program that attempts to do the same thing, but if the count is set to 19 (as in the netcdf program) the HDF5 program fails with an index error.

The fix for this could be ugly.

files.zip

DennisHeimbigner avatar Apr 02 '19 21:04 DennisHeimbigner

I would like to fix this.

edhartnett avatar Apr 02 '19 21:04 edhartnett

To elaborate further, the performance speedup is important, and this is all part of libhdf5.

edhartnett avatar Apr 02 '19 21:04 edhartnett

Ok, if you have the time. My thinking was to look at the vara code to see what it does in a similar situation, and assuming vara works correctlry, there should be some code fragments that handle the unlimited case that we can transfer to the vars code.

DennisHeimbigner avatar Apr 02 '19 22:04 DennisHeimbigner

I know what the vara code does. It finds all other datasets that share that unlimited dimension, and finds the maximum extent of the dimension. Then it fakes fill data as needed to support the unlimited dim.

I will take a look at this within the next week and hopefully have a PR with the fix.

edhartnett avatar Apr 02 '19 23:04 edhartnett

Sounds right. Thanks.

DennisHeimbigner avatar Apr 03 '19 02:04 DennisHeimbigner

Ellen- There may be a temporary workaround. In the file libhdf5/hdf5dispatch.c, at about line 58 change the two lines: NC4_get_vars, NC4_put_vars, to NCDEFAULT_get_vars, NCDEFAULT_put_vars,

This will revert the code to the previous vars code. However, be warned that there will be a significant performance penalty.

DennisHeimbigner avatar Apr 05 '19 20:04 DennisHeimbigner

OK, I have managed to duplicate this error in a simple test...

edhartnett avatar Apr 06 '19 14:04 edhartnett

Hi guys -- is there any update on this? We discovered this bug during an upgrade to netcdf and are blocked until this is resolved, as this bug results in loss of data in this particular workflow. Thanks!

ellenjohnson avatar Jul 09 '19 16:07 ellenjohnson

I could not get a test that reproduced the error. Do you have such a test?

On Tue, Jul 9, 2019 at 10:25 AM ellenjohnson [email protected] wrote:

Hi guys -- is there any update on this? We discovered this bug during an upgrade to netcdf and are blocked until this is resolved, as this bug results in loss of data in this particular workflow. Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/1380?email_source=notifications&email_token=ABCSXXCWEHFKY7W7IT5EFZDP6S3YLA5CNFSM4HDEV2UKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZQZPRY#issuecomment-509712327, or mute the thread https://github.com/notifications/unsubscribe-auth/ABCSXXDNTTWHKYMPS6D5CDLP6S3YLANCNFSM4HDEV2UA .

edhartnett avatar Jul 09 '19 16:07 edhartnett

Hi Ed -- I thought you were able to repro the error in a test (your comment from April 6). If not, then see the files.zip attachment in the initial report. Dennis kindly created this issue after I emailed Dennis and Ward to report the bug and I included a standalone C program to reproduce it. The zip archive contains my ncRepro_WeirdStrideIssue.c program plus test file nctest_netcdf4_classic.nc. It also contains tst_h_stride.c which Dennis created to repro it using pure hdf5 code.

ellenjohnson avatar Jul 09 '19 16:07 ellenjohnson

Any updates on this? Ed, did you see my reply about the repro code? Question: In Dennis' comments, he mentioned the refactoring that seems to have caused this bug (converting the netcdf-4 code to use HDF5 hyperslab for performance). Was this introduced in 4.6.2 or 4.6.3?

ellenjohnson avatar Jul 17 '19 18:07 ellenjohnson

According to the release notes, it was added in 4.6.2-rc1.

DennisHeimbigner avatar Jul 17 '19 18:07 DennisHeimbigner

I am looking at this again now...

edhartnett avatar Nov 15 '19 19:11 edhartnett

OK, @DennisHeimbigner I think you attached the wrong zip file, the one attached is the one related to the endianness issue. So I could still use some help with C code that reproduces this problem...

edhartnett avatar Nov 15 '19 19:11 edhartnett

Sorry, let me attach the correct file. stridetest.zip

DennisHeimbigner avatar Nov 15 '19 20:11 DennisHeimbigner

I've been pinged by the Mathworks team on this; before I dive into it, I want to see if anybody has any updates that aren't reflected in the issue?

WardF avatar Feb 21 '20 22:02 WardF

Yes, I started to look at this. Seeing as you have identified this as important for the next release, I will take a look first thing Monday morning and see if I can come up with an answer...

edwardhartnett avatar Feb 21 '20 23:02 edwardhartnett

This issue has been raised again: https://github.com/Unidata/netcdf-c/issues/1877

DennisHeimbigner avatar Oct 30 '20 16:10 DennisHeimbigner

Hello everyone! Is there any update on this?
Dennis, i see you mentioned https://github.com/Unidata/netcdf-c/issues/1877. That's a performance issue. This issue is with a loss of data -- the last value of the strided read is not being retrieved.

ellenjohnson avatar Mar 09 '21 01:03 ellenjohnson

Ellen- did you ever verify that the suggestion in this comment above fixed the issue?

https://github.com/Unidata/netcdf-c/issues/1380#issuecomment-480411272

DennisHeimbigner avatar Mar 09 '21 04:03 DennisHeimbigner

Hi Dennis and all -- we're revisiting this issue again and I realized I never responded to your question of whether I tried the workaround you suggested:

There may be a temporary workaround.   In the file libhdf5/hdf5dispatch.c, at about line 58  change the two lines:
NC4_get_vars,
NC4_put_vars,
to
NCDEFAULT_get_vars,
NCDEFAULT_put_vars,
This will revert the code to the previous vars code. However, be warned that there will be a significant performance penalty.

I did try this when I was upgrading to 4.7.1, and it didn't fix the issue.

We can try this again with our recently upgraded 4.8.1, but if this temp fix works, I'm concerned about the performance penalty. While we try this, do you have any updates about this? Thanks!

ellenjohnson avatar Sep 30 '21 18:09 ellenjohnson

I am surprised that that work-around did not work. That code was in place for a long time. It means I do not understand the problem at all. Also, that temporary fix does indeed have a significant performance penalty.

DennisHeimbigner avatar Sep 30 '21 19:09 DennisHeimbigner

Thanks Dennis. We're testing your temp workaround now with 4.8.1 just to verify again it doesn't fix the problem. Stay tuned.

ellenjohnson avatar Sep 30 '21 20:09 ellenjohnson

Hi Dennis,

Thanks for telling us the workaround to this issue.

I tried out your suggestion in the 4.8.1 version of the NetCDF library and it works. The data loss doesn't occur.

You did mention there is a significant performance penalty when the patch is applied. We'd like to know if you have plans for addressing the performance issue in a coming release.

agnishd avatar Mar 02 '22 12:03 agnishd