netcdf-c nc_get_vars incredibly slow in Windows compared to Linux

trafficstars

OS: Windows 10 NetCDF version: 4.9.1

I am trying to read a 3D double variable (2000 x 512 x 512) from a netCDF4 file with the following parameters: start = {0,0,0} count[] = {1000, 256, 256}; stride[] = {2, 2, 2}; chunk size: {20, 10, 10} shuffle: no deflate : yes deflate_level : 6

I time the call to nc_get_vars. On Debian 11, it takes ~25 seconds. On Windows 10, it takes ~130 seconds.

I would expect Windows to be slightly slower, but >5x slowdown is unexpected. I see similar slowdowns with 'nc_get_vars_double'

On the contrary, using 'nc_get_var_double' or 'nc_get_var' to read the whole variable is significantly faster (~3 sec on Linux, and ~1 sec on Windows)

Is there a way to optimize the performance of 'nc_get_vars' or 'nc_get_vars_double' so that Windows performance is closer to Linux performance?
Is reading the whole variable using 'nc_get_var' to memory and then slicing it later an option? I have seen that there were some discussions regarding this (https://github.com/Unidata/netcdf-c/issues/908) and that a submission was made to make strided reads faster. But for my variable, reading the whole variable still seems to be significantly faster than strided reads (especially on Windows)

Please find the link to the nc file here. Here is my code:

#include <stdio.h>
#include <string.h>
#include <netcdf.h>
#include <cstdlib>
#include <iostream>
#include <chrono>

int
main()
{
    int status;
    int ncid;
    int varid;

    int elems_x = 256;
	int elems_y = 256;
	int elems_z = 1000;
    double* outData = (double*)malloc (elems_x*elems_y*elems_z*sizeof(double));

    size_t start[] = {0, 0, 0};
    size_t count[] = {1000, 256, 256};
    ptrdiff_t stride[] = {2, 2, 2};

    
    // open the NetCDF-4 file
    status = nc_open("repro_nc4file.nc", NC_NOWRITE, &ncid);
    if(status != NC_NOERR) {
         printf("Could not open file.\n");
    }
   
    // get the varid 
    status = nc_inq_varid(ncid, "my_var", &varid);
    printf("status after inq var = %d\n", status);
    printf("varid = %d\n", varid);

    // get the strided subset
	auto timestart = std::chrono::high_resolution_clock::now();
    status = nc_get_vars(ncid, varid, start, count, stride, outData);
	auto timeend = std::chrono::high_resolution_clock::now();
	auto duration = std::chrono::duration_cast<std::chrono::seconds>(timeend - timestart);
	std::cout << "Execution time: " << duration.count() << " seconds" << std::endl;
    printf("status after getting strided subset = %d\n", status);

    // close the file 
    status = nc_close(ncid);
    printf("status after close = %d\n", status);

    printf("End of test.\n\n");

    return 0;
}

Jul 19 '23 15:07 abhibaruah

I would rewrite the code to try to use vara to see if the speed problem goes away.

Jul 19 '23 15:07 edwardhartnett

You mean use vara to read the values with stride 1 and then do the slicing later?

Jul 19 '23 15:07 abhibaruah

Use vara and jump around to get the slicing you need, so you are reading the exact same data, but without vars.

Jul 19 '23 15:07 edwardhartnett

Hello Ed, I tried your recommendation. The issue is that for using 'nc_get_vara', I ll have to read twice as many elements now (since for my original case, the stride is 2). So, instead of 1000 x 256 x 256 elements, I have to read 2000 x 512 x 512 elements.

Even with nc_get_vara, I still find that Windows is significantly slower:

Windows time: 102 seconds Linux time: 19 seconds

The only change I made to the previous code is to replace status = nc_get_vars(ncid, varid, start, count, stride, outData); with status = nc_get_vara(ncid, varid, start, count, outData);

And int elems_x = 512; int elems_y = 512; int elems_z = 2000;

Jul 19 '23 20:07 abhibaruah

I am taking a look at this to see if I can determine if the slowdown is in libnetcdf, or if it is something in libhdf5.

@abhibaruah a couple questions, if I may, to ensure I'm on the same page.

When you say Windows, you mean Visual Studio, correct? Or a gcc/variant on Windows
What version of libhdf5 are you linking against?

Since we're using libhdf5 for file access, my fear is that this is an issue in libhdf5; that may limit our ability to address this. But it's not necessarily the case. I'll start by reproducing the issue, and go from there :).

Jul 20 '23 16:07 WardF

Thanks @WardF for taking a look.

Yes, I am using Visual Studio (VS2019v16.11.7)
I am linking against HDF5 v1.10.10

Jul 20 '23 16:07 abhibaruah

I recall that this issue was raised some time ago. If memory serves, we proposed to convert vars code to use the corresponding HDF5 operations (I assume we are talking netcdf-4 and not netcdf-3). But apparently this proposal never got implemented.

Jul 20 '23 19:07 DennisHeimbigner

Was the proposed change to use the corresponding HDF5 operations only for Windows? Because for my use case Linux time is reasonable (~20 sec) vs (>100 sec) for Windows.

Jul 21 '23 19:07 abhibaruah

I'm making some progress on this; I haven't narrowed it down to a solution, yet, but I'm able to replicate the observed issue using netCDF v4.9.1 and HDF5 1.10.10. Testing with netCDF main and HDF5 1.14.1, I see performance in line with what's observed in your linux environment. I'm still trying to determine if the culprit is a change in the netCDF code, or if it's a change in the HDF5 code.

Aug 04 '23 21:08 WardF

@abhibaruah I'm seeing some mostly consistent results; out of curiosity, can you give it a try with v4.9.2?

Aug 11 '23 17:08 WardF

Hello @WardF , When you say 'consistent' results, you mean consistent with the slow speeds I saw or similar to the speed on Linux?

Currently, we do not have v4.9.2 in our harness, and hence it will be difficult for me to build v4.9.2 with HDF5 v1.10.10 (will have to go through legal and administrative hoops for that).

I can download the Windows binaries from here (https://downloads.unidata.ucar.edu/netcdf/) and give it a try but I am guessing that you must have already tried it.

Aug 11 '23 18:08 abhibaruah

Let me clarify, thanks :). I'm seeing results consistent with what you've described, and I'm seeing them in a way I've been able to reproduce them. I'm not certain what the underlying issue is, but I am seeing much faster speeds using netCDF-C v4.9.2 (still slightly slower than on Linux, but that could be because of the VM I'm using, etc. But around 45 seconds instead of > 100).

I'm at a loss as to why this is only happening in Linux, and will continue trying to figure that out. I've tested with HDF5 1.10.10 as well as HDF5 1.14.1; the results are the same when using v4.9.1 (> 100 seconds), and faster when using netCDF v4.9.2 ( < 50 seconds), regardless of which version of HDF5 I'm using.

Aug 11 '23 19:08 WardF

Just a note to follow up, HDF5 1.14.2 is out, I'm going to try to test this on Windows. I understand there are hoops to jump through, but the issue does appear to be related to the underlying HDF5 library.

Oct 19 '23 16:10 WardF

@WardF I tried the repro with netCDF 4.9.2 and HDF5 1.10.11. Unfortunately, I am still seeing the same performance difference between Windows 11 and Debian 11. Windows 11 : ~130s Debian 11: ~11s

I am not sure why I am still seeing the slowness on WIndows. I created an HDF5 script to mimic my repro above (but with an H5 file), and the reading of the dataset is much faster (~30s).

Nov 30 '23 15:11 abhibaruah

@abhibaruah thank you, that is good to know at least, the HDF5 script does suggest it is something in netCDF, although why it would be Windows specific is puzzling. I'll pop this back to the top of the stack and see what I can sort out.

Dec 01 '23 17:12 WardF

Hello @WardF Hope you are doing well. I tried the repro steps for this issue with netCCDF 4.9.2 + HDF5 1.14.4.3 and I could still see the slowdown.

Windows time - ~123 s Debian 12 time - ~15 s

Let me know if you find any new information regarding the same.

Thanks, Abhi

Aug 15 '24 20:08 abhibaruah

I also tried the netCDF repro steps with older versions of netCDF. Here are the results (in seconds).

        Windows            Linux

4.6.1 284.1 228 4.8.1 17.8 10.51
4.9.1 115.55 12.25
4.9.2 140 23

Looks like the Windows regression was introduced sometime between 4.8.1 and 4.9.1.

Aug 19 '24 13:08 abhibaruah

Thank you @abhibaruah, that certainly narrows it down. Thank you for bringing this back to the top of the stack, I will see what I can do to dial it in. If I can come up with a test on Windows to replicate this (I should be able to), I can do a git bisect to narrow it down even further. To answer a question I see you asked separately (while I was out of the office on PTO last week), I'm hoping to have rc2 for 4.9.3 out by the end of next week, and then moving forward with the full release barring any feedback which would prevent that. Thanks!

Aug 19 '24 17:08 WardF

netcdf-c netcdf-c copied to clipboard

nc_get_vars incredibly slow in Windows compared to Linux

netcdf-c
netcdf-c copied to clipboard