netcdf4-python icon indicating copy to clipboard operation
netcdf4-python copied to clipboard

Unable to retrieve data readable by Panoply when loaded via memory buffer

Open thehesiod opened this issue 6 years ago • 24 comments

We've recently started using the MADIS METAR dataset, and for one file: https://madis-data.ncep.noaa.gov/madisPublic1/data/archive/2002/05/19/point/metar/netcdf/20020519_0600.gz panoply is able to extract data from several variables (latitude, windGust, etc). However when opened with netCDF 1.4.2 with c-lib: 4.6.1 on both OSX + debian we get the error:

  File "netCDF4/_netCDF4.pyx", line 4119, in netCDF4._netCDF4.Variable.__getitem__
  File "netCDF4/_netCDF4.pyx", line 5036, in netCDF4._netCDF4.Variable._get
  File "netCDF4/_netCDF4.pyx", line 1754, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: Operation not permitted

testcase:

import requests
import netCDF4
import gzip
import tempfile
import traceback
import os.path

response = requests.get("https://madis-data.ncep.noaa.gov/madisPublic1/data/archive/2002/05/19/point/metar/netcdf/20020519_0600.gz")
cdf_bytes = gzip.decompress(response.content)

print("Loading from Memory")
try:
    with netCDF4.Dataset('filename.nc', memory=cdf_bytes, encoding_errors='strict') as cdf_ds:
        cdf_ds['latitude'][:]
        print('ok')
except:
    print("Failed")
    traceback.print_exc()


print("Loading from File")
with tempfile.TemporaryDirectory() as td:
    cdf_path = os.path.join(td, "file.nc")
    with open(cdf_path, "wb") as f:
        f.write(cdf_bytes)

    try:
        with netCDF4.Dataset(cdf_path, encoding_errors='strict') as cdf_ds:
            cdf_ds['latitude'][:]
            print('ok')
    except:
        print("Failed")
        traceback.print_exc()

thehesiod avatar Dec 20 '18 22:12 thehesiod

Can't reproduce on OSX with github master using

netcdf4-python version: 1.4.2 HDF5 lib version: 1.10.4 netcdf lib version: 4.6.1 numpy version 1.15.0

Can you run the netcdf4-python tests (by running test/run_all.py in the source tarball)?

jswhit avatar Dec 20 '18 23:12 jswhit

with master netCDF4:

amohr@ip-192-168-16-18 ~/Downloads/netCDF4-1.4.2/test $ python3 run_all.py 
not running tst_unicode.py ...
not running tst_cdf5.py ...

netcdf4-python version: 1.4.2
HDF5 lib version:       1.10.4
netcdf lib version:     4.6.1
numpy version           1.14.0
cython version          0.28.5
....................................F.../Users/amohr/Downloads/netCDF4-1.4.2/test/tst_types.py:92: UserWarning: WARNING: missing_value not used since it
cannot be safely cast to variable data type
  assert_array_equal(v3[:],-1*np.ones(n2dim,v3.dtype))
.......................................foo_bar
........................
======================================================================
FAIL: runTest (tst_dap.DapTestCase)
testing access of data over http using opendap
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/amohr/Downloads/netCDF4-1.4.2/test/tst_dap.py", line 26, in runTest
    assert var.shape == varshape
AssertionError

----------------------------------------------------------------------
Ran 103 tests in 19.973s

FAILED (failures=1)

hmm, trying with newer numpy

thehesiod avatar Dec 20 '18 23:12 thehesiod

still fails, this may be related to loading the file as a memory buffer, lemme get u a testcase

thehesiod avatar Dec 20 '18 23:12 thehesiod

ok, see updated testcase in description

updated title to reflect findings

thehesiod avatar Dec 20 '18 23:12 thehesiod

ugh, I think I'm locally hitting https://github.com/Unidata/netcdf4-python/issues/752, verifying

thehesiod avatar Dec 21 '18 00:12 thehesiod

nope, verified it happens with the old build as well. So this is a new memory load issue. I've updated the testcase to show.

thehesiod avatar Dec 21 '18 02:12 thehesiod

May be to https://github.com/Unidata/netcdf-c/issues/770 ?

jswhit avatar Dec 21 '18 03:12 jswhit

don't think so, Dataset.file_format == 'NETCDF3_CLASSIC'

thehesiod avatar Dec 21 '18 03:12 thehesiod

Looks like the data file is malformed (the header says it has more records that it actually has - see https://github.com/Unidata/netcdf-c/issues/1263).

jswhit avatar Jan 03 '19 01:01 jswhit

Is there anyway to figure out the genesis of this file: specifically what software wrote it?

DennisHeimbigner avatar Jan 03 '19 19:01 DennisHeimbigner

Here's the pure python writer: https://github.com/scipy/scipy/blob/master/scipy/io/netcdf.py. I don't see any special metadata that is written to identify the writer, but it might be in there somewhere.

jswhit avatar Jan 03 '19 20:01 jswhit

My questions to the MADIS team as of late have gone unanswered ([email protected]). They would know definitively.

thehesiod avatar Jan 03 '19 20:01 thehesiod

Not surprising they are not answering - they are not supposed to be reading their .gov emails.

jswhit avatar Jan 03 '19 21:01 jswhit

If the data are malformed, that's bad--but it doesn't explain to me why writing to disk and reading it in that way then works.

dopplershift avatar Jan 03 '19 22:01 dopplershift

I do not know for sure that the file is malformed at least according to the code. There is code in the posixio module that may (its hard to figure out) allow this case. I am trying to figure out how this file was constructed to try to see if the netcdf-c library actually allows this, though how such a file is created is currently a mystery to me; I can see no way to alter the length of UNLIMITED without writing data.

DennisHeimbigner avatar Jan 03 '19 22:01 DennisHeimbigner

I can an experiment by using this command: nccopy 20020519_0600 20020519_0600_2 This should have produced an exact copy of the input file. And using ncdump to compare the metadata, it did. However the size of 20020519_0600 was 2433024 and the sizeof 20020519_0600_2 was 2529252 In other words, the copy of the file has, apparently, the correct size as indicated by the length of unlimited. This second file works with nc_open_mem. From this, I conclude two things.

  1. someone truncated the original file, perhaps by removing trailing zeros?
  2. the netcdf-c library is in fact designed to handle such cases.

DennisHeimbigner avatar Jan 04 '19 02:01 DennisHeimbigner

Russ rew pointed me to this netcdf FAQ entry: https://www.unidata.ucar.edu/software/netcdf/docs/faq.html#Can-I-recover-data-from-a-netCDF-file-that-was-not-closed-properly

This probably explains how this file came to be constructed: someone forgot to call nc_close() on it.

The question is, however, what to do about this file?

DennisHeimbigner avatar Jan 04 '19 18:01 DennisHeimbigner

This is preventing me from processing existing data so I hope it gets fixed/supported, perhaps with a warning or something. Either way the behavior should be consistent no matter which way the file gets opened.

thehesiod avatar Jan 04 '19 19:01 thehesiod

The problem you face is that the file is being extended by either zeros or random data. How can you trust this? Is zero extension ok?

DennisHeimbigner avatar Jan 04 '19 20:01 DennisHeimbigner

Note also that because of the layout of the unlimited dimension data, this extension will affect every variable that uses the unlimited dimension.

DennisHeimbigner avatar Jan 04 '19 20:01 DennisHeimbigner

well, opening in panpoly yields the following for latitude: screen shot 2019-01-04 at 12 26 29 pm

thehesiod avatar Jan 04 '19 20:01 thehesiod

The metadata is ok in that file, but the data is incorrect in the sense that some number of the values of latitude (approx those with index 1890 - 1969) are either zero or some random value.

DennisHeimbigner avatar Jan 04 '19 20:01 DennisHeimbigner

i c, thanks, so I'll let you guys decide what to do with this case. Looks like I lucked out in that I'm using the mem API so I skip that file, other people are going to get some garbage ;)

thehesiod avatar Jan 04 '19 21:01 thehesiod

another possibility I guess is for the file case to throw once it hits EOF

thehesiod avatar Jan 04 '19 21:01 thehesiod