netcdf4-python icon indicating copy to clipboard operation
netcdf4-python copied to clipboard

segfault in filepath Dataset method on some platforms

Open jswhit opened this issue 9 years ago • 9 comments

reported by @ocefpaf, see https://github.com/conda-forge/staged-recipes/pull/43

Can be triggered (on some platforms) with

import netCDF4
url = 'http://test.opendap.org:80/opendap/data/ncml/sample_virtual_dataset.ncml'
nc = netCDF4.Dataset(url)
assert nc.filepath() == url  # segfault here with version 1.2.2

Reverting filepath to the version in 1.2.1 with this patch fixes it

--- a/netCDF4/_netCDF4.pyx
+++ b/netCDF4/_netCDF4.pyx
@@ -1854,12 +1854,9 @@ Get the file system path (or the opendap URL) which was used to
 open/create the Dataset. Requires netcdf >= 4.1.2"""
         cdef int ierr
         cdef size_t pathlen
-        cdef char *path
+        cdef char path[NC_MAX_NAME + 1]
         IF HAS_NC_INQ_PATH:
             with nogil:
-                ierr = nc_inq_path(self._grpid, &pathlen, NULL)
-            path = <char *>malloc(sizeof(char) * pathlen)
-            with nogil:
                 ierr = nc_inq_path(self._grpid, &pathlen, path)
             return path.decode('ascii')
         ELSE:

However, I would rather not to do this since it will truncate paths longer than NC_MAX_NAME+1.

jswhit avatar Jan 08 '16 18:01 jswhit

@jswhit bare in mind that the issue surfaces under a very specific condition: packaging with conda using while using an old GLIBC (CentOS6). I cannot reproduce that problem in a modern Linux distro (recent GLIBC).

(The reason why we build conda packages with an old CentOS is to ensure compatibility across all GLIBC versions.)

ocefpaf avatar Jan 08 '16 18:01 ocefpaf

@ocefpaf, would you mind printing out the value of pathlen, right after the first call to nc_inq_path?

jswhit avatar Jan 08 '16 19:01 jswhit

pathlen seems OK 72

('http://test.opendap.org:80/opendap/data/ncml/sample_virtual_dataset.ncml' does have 72 characters.)

ocefpaf avatar Jan 08 '16 19:01 ocefpaf

@ocefpaf: I've updated netCDF4.pxi and _netCDF4.pyx to use malloc and free imported from cython (which I believe allocates memory on the python heap) instead of using the libc version directly (pull request #507). I've also added some exception handling to allow it to fail more gracefully if it can't allocate the memory for the filepath pointer. Hopefully this helps.

jswhit avatar Jan 09 '16 14:01 jswhit

according to @ocefpaf, the segfault still occurs on CentOS6. I'm out of ideas at this point.

jswhit avatar Jan 12 '16 13:01 jswhit

Sorry I cannot by of more help. I do not know the first thing about C and cython.

Here are the Linux build logs:

https://circleci.com/gh/conda-forge/staged-recipes/151

@pelson any ideas?

ocefpaf avatar Jan 12 '16 13:01 ocefpaf

@pelson any ideas?

Might make sense to be able to reproduce quickly with just a docker image and a few commands...

There is nothing worse than not being able to recreate an issue yourself, and it makes sense to be able to use pdb etc to track things down.

pelson avatar Jan 12 '16 15:01 pelson

I'm going to go ahead and merge the pull request, since it should prevent memory leakage, even if it doesn't fix the segfaults. Seems likely that the segfault is coming from either a bug in the netcdf c-lib, or the old glibc.

jswhit avatar Jan 15 '16 03:01 jswhit

I was putting together a Dockerfile to run the tests using CircleCI and I noticed that:

  • the test added in https://github.com/Unidata/netcdf4-python/commit/4f7535fe5feb07708e77d189e0e6cfa0a7f14286 passes just fine;
  • the URL I used before is sill failing;
  • when changing URLs the nc.filepath() test passes. (See https://github.com/conda-forge/staged-recipes/pull/43).

I have no idea what is going on! Also, at least so far, the problem seems to be relate only to the URL I used for testing.

Thanks @jswhit for looking into this. I believe it is not worth try to fix this unless we find more failing cases.

ocefpaf avatar Jan 18 '16 19:01 ocefpaf