netcdf-c icon indicating copy to clipboard operation
netcdf-c copied to clipboard

`read` operations cause too many disk reads with unlimited dimensions

Open haileyajohnson opened this issue 8 months ago • 20 comments

@lesserwhirls - the problem as far as I can tell exists in netcdf-java as well.

This is a repeat of this issue which can probably be closed in favor of this one.

The problem: we're hosting netcdf-classic files with in EFS and reading them from EC2, so we're charged per bytes read. The charges have been unexpectedly high (order of magnitude) for reads from our datasets with unlimited dimensions.

I've attached the results of some testing I've done locally on my mac using fs-usage, input files, the output of those tests, and the python script I used to count bytes read. The short summary is output from reading a fixed dimension dataset are about what we expect with both ncdump and toolsui, but way too high when reading from a record dataset.

test_disk_reads.zip

haileyajohnson avatar Apr 15 '25 19:04 haileyajohnson

Updating my analysis to include some info on reading from a netcdf4 dataset with an unlimited dimension:

It works as expected with ncdump! This is great because we have full control over the datasets in question and can switch to nc4 pretty easily to save a lot of money. There are still two problems with that though:

  1. ncdump makes a read of ~4.2Mb early on in the process, no matter the file (nc3, nc4, fixed dimension, unlimited), which is not inconsequential.
  2. ToolsUI still does a bad job with over-reading from disk with netcdf4s, so any datasets we have exposed via THREDDS would still be a liability.

haileyajohnson avatar Apr 15 '25 23:04 haileyajohnson

I am going to guess the over-reading by netCDF-Java is due to this?

lesserwhirls avatar Apr 16 '25 17:04 lesserwhirls

Maybe, but we're talking about ~150 Mb of disk reads to ncdump a 7.9Mb file, which is more than just touching every memory block. ^That's with both java and C

haileyajohnson avatar Apr 16 '25 18:04 haileyajohnson

Interesting; I'm taking a look at this now. I'm in rural Oregon (again), forgive the late response. I'm curious if the same behavior is observed with h5dump; it will be helpful to figure out if this is something in libnetcdf or libhdf5.

WardF avatar Apr 16 '25 18:04 WardF

Also, thanks for the tip about fs-usage, I was unfamiliar.

WardF avatar Apr 16 '25 18:04 WardF

Also, thanks for the tip about fs-usage, I was unfamiliar.

Honestly ChatGPT and I had some fights before finally nailing down a workflow I was reasonably confident in 😆

haileyajohnson avatar Apr 16 '25 19:04 haileyajohnson

If the workflow is portable and easy to share, I wouldn't mind taking a peek and using it to recreate the issue and then see what we're able to do about it/testing any potential fixes. :)

WardF avatar Apr 16 '25 19:04 WardF

It should be portable to a mac! The quick synopsis is:

  • run fs_usage -w -f filesys and >> the results into a text file
  • run the ncdump command I'm testing in a another window
  • end fs_usage
  • change the output file for fs~usage and run again on next test case

Then once I had all the logs I parsed them with python regex to find the number of bytes recorded as read by ncdump. I tried with grep and awk, but it was easier for me with python, I'm sure you'd have better luck than I there.

A couple of things I've learned while staring at these logs for hours:

  • recording reads only is easier to parse, but if you record everything you can also see when ncdump opens and closes the actual dataset file, which is also useful.
  • With netcdf4, all the data reads are pread instead of read (which totally makes sense but that caught me up for a while when I was only searching for reads)

haileyajohnson avatar Apr 16 '25 19:04 haileyajohnson

I'll take a look when I get back to Colorado next week, when I have access to a MacOS machine with sudo. Thanks!

WardF avatar Apr 17 '25 17:04 WardF

@haileyajohnson out of curiosity, is the same behavior observed if you try using the NCO tools?

WardF avatar Apr 21 '25 16:04 WardF

My initial un-tested thought is that we may be re-reading the metadata any time we need to; this was probably less of an issue when file storage was all local. I've arrived back in Colorado and will sit down and see if I can set up the same workflow and then start to sort this out. If we are able to determine whether or not this is happening in NCO, it will help narrow down if this is happening in libnetcdf or in ncdump, and also potentially offer a stopgap/work around.

WardF avatar Apr 21 '25 16:04 WardF

I can checkout NCO and get back to you. Your initial untested thought seems on the right track to me, looking at metrics on AWS I can see the reads split up into data reads, which more or less make sense, and metadata reads, which seem large.

haileyajohnson avatar Apr 21 '25 16:04 haileyajohnson

My theory for the netcdf3 overreading (which isn't blocking us btw) is that reading a single unlimited variable works as intended in that it reads the whole record, but reading multiple unlimited variables reads the whole record again for each variable. I'll probably self-assign looking into that in netcdf-java, but I doubt I'll get to it this week :)

haileyajohnson avatar Apr 21 '25 16:04 haileyajohnson

That sounds like a plausible explanation for netcdf-3.

DennisHeimbigner avatar Apr 21 '25 21:04 DennisHeimbigner

I can checkout NCO and get back to you. Your initial untested thought seems on the right track to me, looking at metrics on AWS I can see the reads split up into data reads, which more or less make sense, and metadata reads, which seem large.

I would be surprised if this was the case. Ed Hartnett did some significant changes to add lazy meta-data evaluation.

DennisHeimbigner avatar Apr 21 '25 21:04 DennisHeimbigner

Focusing on the netCDF 3 case...

I've finally had a chance to take a deeper look at this. Focusing on netCDF 3, the disk reads appear inflated because the data for each variable with an unlimited dimension will be spread out over the entire file. If you need to read all of the data from each variable one variable at a time (like with ncdump), you will likely end up reading nearly the entire file for each variable with the unlimited dimension.

I found this image particularly helpful:

Image

(https://www.unidata.ucar.edu/software/netcdf/workshops/2007/performance/FileFormat.html)

Speaking for netCDF-Java specifically, we do buffered reads from disk using a default size of 8092 bytes. If I am interpreting the D0513_nowcast.nc file properly, each set of records will consume approximately 630 bytes, so multiple sets of records would be contained in each buffered read. You would need to read those records again for each variable you are dumping data from - so if we estimate using 18 variables with an unlimited dimension x ~7 MiB of disk access for each, we'd get 126 MiB which seems to explain what you see for the netCDF 3 (with an unlimited dimension) case.

lesserwhirls avatar Apr 23 '25 19:04 lesserwhirls

@lesserwhirls thanks for taking a look! That's the conclusion I came to as well. Which then worried me because I thought several TDS services (like ncss) iterated wantedVariables sequentially, which could pretty quickly balloon into huge read charges, but on quick glance I think I was wrong about that. Either way, I'm happy to make double checking that/fixing any places ncj does iterate and shouldn't my own TODO (after DMAC next week).

haileyajohnson avatar Apr 23 '25 20:04 haileyajohnson

Can anyone suggest some form of caching that might help alleviate this problem?

DennisHeimbigner avatar Apr 24 '25 19:04 DennisHeimbigner

In netCDF-Java, the RemoteRandomAccessFile abstract class (which has concrete implementations for HTTP and S3) has an in-memory read cache, which caches the read-buffer-sized requests. The default size of the cache is 10 MiB (configurable), so in the case of the 7-ish MiB file (D0513_nowcast.nc), you would still end up accessing the full file over S3 once, but only once as all other reads should come out of the cached requests. The read cache could still help for accessing all of the unlimited dimension variable data for files larger than the size of the read cache, but the benefits would diminish with increasing size.

lesserwhirls avatar Apr 25 '25 13:04 lesserwhirls

I can imagine two different caches:

  1. cache each record (I think this is the same as the RemoteRandomAccessFile cache)
  2. Have n separate caches where n is the number of variables in a record. Reading a new record adds an entry to each of the variable caches.

DennisHeimbigner avatar Apr 25 '25 21:04 DennisHeimbigner