netcdf4-python
netcdf4-python copied to clipboard
Unable to retrieve data readable by Panoply when loaded via memory buffer
We've recently started using the MADIS METAR dataset, and for one file: https://madis-data.ncep.noaa.gov/madisPublic1/data/archive/2002/05/19/point/metar/netcdf/20020519_0600.gz panoply is able to extract data from several variables (latitude, windGust, etc). However when opened with netCDF 1.4.2 with c-lib: 4.6.1 on both OSX + debian we get the error:
File "netCDF4/_netCDF4.pyx", line 4119, in netCDF4._netCDF4.Variable.__getitem__
File "netCDF4/_netCDF4.pyx", line 5036, in netCDF4._netCDF4.Variable._get
File "netCDF4/_netCDF4.pyx", line 1754, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: Operation not permitted
testcase:
import requests
import netCDF4
import gzip
import tempfile
import traceback
import os.path
response = requests.get("https://madis-data.ncep.noaa.gov/madisPublic1/data/archive/2002/05/19/point/metar/netcdf/20020519_0600.gz")
cdf_bytes = gzip.decompress(response.content)
print("Loading from Memory")
try:
with netCDF4.Dataset('filename.nc', memory=cdf_bytes, encoding_errors='strict') as cdf_ds:
cdf_ds['latitude'][:]
print('ok')
except:
print("Failed")
traceback.print_exc()
print("Loading from File")
with tempfile.TemporaryDirectory() as td:
cdf_path = os.path.join(td, "file.nc")
with open(cdf_path, "wb") as f:
f.write(cdf_bytes)
try:
with netCDF4.Dataset(cdf_path, encoding_errors='strict') as cdf_ds:
cdf_ds['latitude'][:]
print('ok')
except:
print("Failed")
traceback.print_exc()
Can't reproduce on OSX with github master using
netcdf4-python version: 1.4.2 HDF5 lib version: 1.10.4 netcdf lib version: 4.6.1 numpy version 1.15.0
Can you run the netcdf4-python tests (by running test/run_all.py in the source tarball)?
with master netCDF4:
amohr@ip-192-168-16-18 ~/Downloads/netCDF4-1.4.2/test $ python3 run_all.py
not running tst_unicode.py ...
not running tst_cdf5.py ...
netcdf4-python version: 1.4.2
HDF5 lib version: 1.10.4
netcdf lib version: 4.6.1
numpy version 1.14.0
cython version 0.28.5
....................................F.../Users/amohr/Downloads/netCDF4-1.4.2/test/tst_types.py:92: UserWarning: WARNING: missing_value not used since it
cannot be safely cast to variable data type
assert_array_equal(v3[:],-1*np.ones(n2dim,v3.dtype))
.......................................foo_bar
........................
======================================================================
FAIL: runTest (tst_dap.DapTestCase)
testing access of data over http using opendap
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/amohr/Downloads/netCDF4-1.4.2/test/tst_dap.py", line 26, in runTest
assert var.shape == varshape
AssertionError
----------------------------------------------------------------------
Ran 103 tests in 19.973s
FAILED (failures=1)
hmm, trying with newer numpy
still fails, this may be related to loading the file as a memory buffer, lemme get u a testcase
ok, see updated testcase in description
updated title to reflect findings
ugh, I think I'm locally hitting https://github.com/Unidata/netcdf4-python/issues/752, verifying
nope, verified it happens with the old build as well. So this is a new memory load issue. I've updated the testcase to show.
May be to https://github.com/Unidata/netcdf-c/issues/770 ?
don't think so, Dataset.file_format == 'NETCDF3_CLASSIC'
Looks like the data file is malformed (the header says it has more records that it actually has - see https://github.com/Unidata/netcdf-c/issues/1263).
Is there anyway to figure out the genesis of this file: specifically what software wrote it?
Here's the pure python writer: https://github.com/scipy/scipy/blob/master/scipy/io/netcdf.py. I don't see any special metadata that is written to identify the writer, but it might be in there somewhere.
My questions to the MADIS team as of late have gone unanswered ([email protected]). They would know definitively.
Not surprising they are not answering - they are not supposed to be reading their .gov emails.
If the data are malformed, that's bad--but it doesn't explain to me why writing to disk and reading it in that way then works.
I do not know for sure that the file is malformed at least according to the code. There is code in the posixio module that may (its hard to figure out) allow this case. I am trying to figure out how this file was constructed to try to see if the netcdf-c library actually allows this, though how such a file is created is currently a mystery to me; I can see no way to alter the length of UNLIMITED without writing data.
I can an experiment by using this command: nccopy 20020519_0600 20020519_0600_2 This should have produced an exact copy of the input file. And using ncdump to compare the metadata, it did. However the size of 20020519_0600 was 2433024 and the sizeof 20020519_0600_2 was 2529252 In other words, the copy of the file has, apparently, the correct size as indicated by the length of unlimited. This second file works with nc_open_mem. From this, I conclude two things.
- someone truncated the original file, perhaps by removing trailing zeros?
- the netcdf-c library is in fact designed to handle such cases.
Russ rew pointed me to this netcdf FAQ entry: https://www.unidata.ucar.edu/software/netcdf/docs/faq.html#Can-I-recover-data-from-a-netCDF-file-that-was-not-closed-properly
This probably explains how this file came to be constructed: someone forgot to call nc_close() on it.
The question is, however, what to do about this file?
This is preventing me from processing existing data so I hope it gets fixed/supported, perhaps with a warning or something. Either way the behavior should be consistent no matter which way the file gets opened.
The problem you face is that the file is being extended by either zeros or random data. How can you trust this? Is zero extension ok?
Note also that because of the layout of the unlimited dimension data, this extension will affect every variable that uses the unlimited dimension.
well, opening in panpoly yields the following for latitude:
The metadata is ok in that file, but the data is incorrect in the sense that some number of the values of latitude (approx those with index 1890 - 1969) are either zero or some random value.
i c, thanks, so I'll let you guys decide what to do with this case. Looks like I lucked out in that I'm using the mem API so I skip that file, other people are going to get some garbage ;)
another possibility I guess is for the file case to throw once it hits EOF