`read` operations cause too many disk reads with unlimited dimensions
@lesserwhirls - the problem as far as I can tell exists in netcdf-java as well.
This is a repeat of this issue which can probably be closed in favor of this one.
The problem: we're hosting netcdf-classic files with in EFS and reading them from EC2, so we're charged per bytes read. The charges have been unexpectedly high (order of magnitude) for reads from our datasets with unlimited dimensions.
I've attached the results of some testing I've done locally on my mac using fs-usage, input files, the output of those tests, and the python script I used to count bytes read. The short summary is output from reading a fixed dimension dataset are about what we expect with both ncdump and toolsui, but way too high when reading from a record dataset.
Updating my analysis to include some info on reading from a netcdf4 dataset with an unlimited dimension:
It works as expected with ncdump! This is great because we have full control over the datasets in question and can switch to nc4 pretty easily to save a lot of money. There are still two problems with that though:
- ncdump makes a read of ~4.2Mb early on in the process, no matter the file (nc3, nc4, fixed dimension, unlimited), which is not inconsequential.
- ToolsUI still does a bad job with over-reading from disk with netcdf4s, so any datasets we have exposed via THREDDS would still be a liability.
I am going to guess the over-reading by netCDF-Java is due to this?
Maybe, but we're talking about ~150 Mb of disk reads to ncdump a 7.9Mb file, which is more than just touching every memory block. ^That's with both java and C
Interesting; I'm taking a look at this now. I'm in rural Oregon (again), forgive the late response. I'm curious if the same behavior is observed with h5dump; it will be helpful to figure out if this is something in libnetcdf or libhdf5.
Also, thanks for the tip about fs-usage, I was unfamiliar.
Also, thanks for the tip about
fs-usage, I was unfamiliar.
Honestly ChatGPT and I had some fights before finally nailing down a workflow I was reasonably confident in 😆
If the workflow is portable and easy to share, I wouldn't mind taking a peek and using it to recreate the issue and then see what we're able to do about it/testing any potential fixes. :)
It should be portable to a mac! The quick synopsis is:
- run
fs_usage -w -f filesysand >> the results into a text file - run the
ncdumpcommand I'm testing in a another window - end
fs_usage - change the output file for
fs~usageand run again on next test case
Then once I had all the logs I parsed them with python regex to find the number of bytes recorded as read by ncdump. I tried with grep and awk, but it was easier for me with python, I'm sure you'd have better luck than I there.
A couple of things I've learned while staring at these logs for hours:
- recording reads only is easier to parse, but if you record everything you can also see when ncdump opens and closes the actual dataset file, which is also useful.
- With netcdf4, all the data reads are
preadinstead ofread(which totally makes sense but that caught me up for a while when I was only searching for reads)
I'll take a look when I get back to Colorado next week, when I have access to a MacOS machine with sudo. Thanks!
@haileyajohnson out of curiosity, is the same behavior observed if you try using the NCO tools?
My initial un-tested thought is that we may be re-reading the metadata any time we need to; this was probably less of an issue when file storage was all local. I've arrived back in Colorado and will sit down and see if I can set up the same workflow and then start to sort this out. If we are able to determine whether or not this is happening in NCO, it will help narrow down if this is happening in libnetcdf or in ncdump, and also potentially offer a stopgap/work around.
I can checkout NCO and get back to you. Your initial untested thought seems on the right track to me, looking at metrics on AWS I can see the reads split up into data reads, which more or less make sense, and metadata reads, which seem large.
My theory for the netcdf3 overreading (which isn't blocking us btw) is that reading a single unlimited variable works as intended in that it reads the whole record, but reading multiple unlimited variables reads the whole record again for each variable. I'll probably self-assign looking into that in netcdf-java, but I doubt I'll get to it this week :)
That sounds like a plausible explanation for netcdf-3.
I can checkout NCO and get back to you. Your initial untested thought seems on the right track to me, looking at metrics on AWS I can see the reads split up into data reads, which more or less make sense, and metadata reads, which seem large.
I would be surprised if this was the case. Ed Hartnett did some significant changes to add lazy meta-data evaluation.
Focusing on the netCDF 3 case...
I've finally had a chance to take a deeper look at this. Focusing on netCDF 3, the disk reads appear inflated because the data for each variable with an unlimited dimension will be spread out over the entire file. If you need to read all of the data from each variable one variable at a time (like with ncdump), you will likely end up reading nearly the entire file for each variable with the unlimited dimension.
I found this image particularly helpful:
(https://www.unidata.ucar.edu/software/netcdf/workshops/2007/performance/FileFormat.html)
Speaking for netCDF-Java specifically, we do buffered reads from disk using a default size of 8092 bytes. If I am interpreting the D0513_nowcast.nc file properly, each set of records will consume approximately 630 bytes, so multiple sets of records would be contained in each buffered read. You would need to read those records again for each variable you are dumping data from - so if we estimate using 18 variables with an unlimited dimension x ~7 MiB of disk access for each, we'd get 126 MiB which seems to explain what you see for the netCDF 3 (with an unlimited dimension) case.
@lesserwhirls thanks for taking a look! That's the conclusion I came to as well. Which then worried me because I thought several TDS services (like ncss) iterated wantedVariables sequentially, which could pretty quickly balloon into huge read charges, but on quick glance I think I was wrong about that. Either way, I'm happy to make double checking that/fixing any places ncj does iterate and shouldn't my own TODO (after DMAC next week).
Can anyone suggest some form of caching that might help alleviate this problem?
In netCDF-Java, the RemoteRandomAccessFile abstract class (which has concrete implementations for HTTP and S3) has an in-memory read cache, which caches the read-buffer-sized requests. The default size of the cache is 10 MiB (configurable), so in the case of the 7-ish MiB file (D0513_nowcast.nc), you would still end up accessing the full file over S3 once, but only once as all other reads should come out of the cached requests. The read cache could still help for accessing all of the unlimited dimension variable data for files larger than the size of the read cache, but the benefits would diminish with increasing size.
I can imagine two different caches:
- cache each record (I think this is the same as the RemoteRandomAccessFile cache)
- Have n separate caches where n is the number of variables in a record. Reading a new record adds an entry to each of the variable caches.