netcdf4-python icon indicating copy to clipboard operation
netcdf4-python copied to clipboard

Writing to multiple unlimited dimension variables, seg-fault and/or free() invalid next size

Open zerothi opened this issue 4 years ago • 5 comments

I am trying to do incremental work in a nc file (each element takes a considerable amount of time, so I would like to do them when I have the time).

I have this small snippet which seems to reproduce it:

import os
import numpy as np
import netCDF4 as nc

def func(E):
    return [np.random.rand(2, 2) for e in E]

Nk = np.arange(30, 2401)
ETA = np.array([1.e-6, 2.5e-6, 5e-6, 7.5e-6, 1.e-5, 2.5e-5, 5e-5, 7.5e-5, 1.e-4, 2.5e-4, 5e-4, 7.5e-4, 1e-3])
E = np.linspace(-3, 3, 300)

# Create a new file containing all Gf's calculate
if os.path.isfile('TestFile.nc'):
    f = nc.Dataset('TestFile.nc', 'a', format='NETCDF4')
    varnk = f.variables['nk']
    vareta = f.variables['eta']
    varcompleted = f.variables['completed']
    bandgrp = f.groups['band']
    bandG = bandgrp.variables['G']

else:
    f = nc.Dataset('TestFile.nc', 'w', format='NETCDF4')
    f.createDimension("E", None)
    f.createDimension("nk", None)
    f.createDimension("eta", None)
    f.createDimension("no", 2)
    
    varE = f.createVariable("E", "f8", ("E"))
    varE[:] = E[:]
    varnk = f.createVariable("nk", "i4", ("nk"))
    vareta = f.createVariable("eta", "f8", ("eta"))
    varcompleted = f.createVariable("completed", "i4", ("eta", "nk"))

    bandgrp = f.createGroup('band')
    bandG = bandgrp.createVariable("G","f8",("eta", "nk", "E", "no", "no"))

print(f)

def add_value(f, variable, value):
    if variable.shape == (0,):
        variable[0] = value
        return 0
    idx = (variable[:] == value).nonzero()[0]
    if len(idx) == 1:
        return idx[0]
    variable[variable.shape[0]] = value
    return variable.shape[0] - 1

# Now perform timing and calculate maximum differences between the two methods
for eta in ETA:
    print('Running for eta = {}'.format(eta))

    idxeta = add_value(f, vareta, eta)
    E = E.real + 1j * eta

    for nk in Nk:
        idxnk = add_value(f, varnk, nk)

        if varcompleted[idxeta, idxnk] == 1:
            # we already have it calculated
            continue
        varcompleted[idxeta, idxnk] = 0
        f.sync()

        Gf = func(E)
        bandG[idxeta, idxnk, :, :, :] = [g for g in Gf]

        varcompleted[idxeta, idxnk] = 1
        f.sync()

Sometimes I get a seg-fault, and other times I get free(): invalid next size (fast). None of which are exactly reproduce able to a specific index or.

My work flow is:

  1. Run script for some time, then kill it (sync should ensure everything is fine)
  2. Re-run script which skips the already calculated elements

It sometimes fails from initial calculation (i.e. without restart), and sometimes from the restart. I have tried a gdb run and it gives something like:

#0  0x00001555551917bb in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x000015555517c535 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00001555551d3508 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00001555551d9c1a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00001555551db4d6 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x0000155553fd0004 in PyDataMem_FREE (ptr=0x1cb1220) at numpy/core/src/multiarray/alloc.c:264
#6  _npy_free_cache (dealloc=<optimized out>, cache=0x155554789900 <datacache>, msz=1024, 
    nelem=<optimized out>, p=0x1cb1220) at numpy/core/src/multiarray/alloc.c:104
#7  npy_free_cache (p=0x1cb1220, sz=<optimized out>) at numpy/core/src/multiarray/alloc.c:139
#8  0x0000155553fd7c53 in array_dealloc (self=0x15553b652a80)
    at numpy/core/include/numpy/ndarraytypes.h:1490
#9  0x000000000045cae8 in list_dealloc (op=0x15553a9d6dc8) at ../Objects/listobject.c:324
#10 0x0000000000485b67 in insertdict (value=<optimized out>, hash=<optimized out>, 
    key=0x155554a6b7d8, mp=<optimized out>) at ../Objects/dictobject.c:1076
#11 PyDict_SetItem (op=<optimized out>, key=0x155554a6b7d8, value=<optimized out>)
    at ../Objects/dictobject.c:1463
#12 0x000000000042d6bf in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
    at ../Python/ceval.c:1935
#13 0x000000000054ccd7 in _PyEval_EvalCodeWithName (_co=_co@entry=0x155554981300, 
    globals=globals@entry=0x155554aab240, locals=locals@entry=0x155554aab240, 
    args=args@entry=0x0, argcount=argcount@entry=0, kwnames=kwnames@entry=0x0, kwargs=0x0, 
    kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0)
    at ../Python/ceval.c:3930
#14 0x000000000054d50e in PyEval_EvalCodeEx (_co=_co@entry=0x155554981300, 
    globals=globals@entry=0x155554aab240, locals=locals@entry=0x155554aab240, 
    args=args@entry=0x0, argcount=argcount@entry=0, kws=kws@entry=0x0, kwcount=0, defs=0x0, 
    defcount=0, kwdefs=0x0, closure=0x0) at ../Python/ceval.c:3959
#15 0x000000000054d53b in PyEval_EvalCode (co=co@entry=0x155554981300, 
    globals=globals@entry=0x155554aab240, locals=locals@entry=0x155554aab240)
    at ../Python/ceval.c:524
#16 0x000000000058bd31 in run_mod (arena=0x155554ba2078, flags=0x7fffffff467c, 
    locals=0x155554aab240, globals=0x155554aab240, filename=0x1555549143f8, mod=0x89a628)
    at ../Python/pythonrun.c:1035
#17 PyRun_FileExFlags (fp=fp@entry=0x7a9620, 
    filename_str=filename_str@entry=0x155554908878 "/home/nicpa/articles/rs-se/run_rsGF.py", 
    start=start@entry=257, globals=globals@entry=0x155554aab240, 
    locals=locals@entry=0x155554aab240, closeit=closeit@entry=1, flags=0x7fffffff467c)
    at ../Python/pythonrun.c:988
#18 0x000000000058bec2 in PyRun_SimpleFileExFlags (fp=fp@entry=0x7a9620, 
    filename=<optimized out>, closeit=closeit@entry=1, flags=flags@entry=0x7fffffff467c)
    at ../Python/pythonrun.c:429
#19 0x000000000058c364 in PyRun_AnyFileExFlags (fp=fp@entry=0x7a9620, filename=<optimized out>, 
    closeit=closeit@entry=1, flags=flags@entry=0x7fffffff467c) at ../Python/pythonrun.c:84
#20 0x000000000043a4b0 in pymain_run_file (p_cf=0x7fffffff467c, filename=<optimized out>, 
    fp=0x7a9620) at ../Modules/main.c:427
#21 pymain_run_filename (cf=0x7fffffff467c, pymain=0x7fffffff4750) at ../Modules/main.c:1627
#22 pymain_run_python (pymain=0x7fffffff4750) at ../Modules/main.c:2877
#23 pymain_main (pymain=pymain@entry=0x7fffffff4750) at ../Modules/main.c:3038
#24 0x000000000043a6fe in _Py_UnixMain (argc=<optimized out>, argv=<optimized out>)
    at ../Modules/main.c:3073
#25 0x000015555517e09b in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#26 0x0000000000430a9a in _start ()

I don't know how relevant this is.

But perhaps a reference counting for a numpy array and garbage collection in netcdf file vs numpy is trickering this, (just a guess).

zerothi avatar Aug 15 '19 13:08 zerothi

There are bugs in netcdf-c for writing to variables with multiple unlimited dimensions (see #933, https://github.com/Unidata/netcdf-c/issues/1413). Wouldn't be surprised if this is another one.

jswhit avatar Aug 19 '19 01:08 jswhit

Thanks! Feel free to close this or if you want this to be open until fixed upstream. :)

zerothi avatar Aug 19 '19 06:08 zerothi

Will leave this open - I'm not sure this a bug upstream. I do wonder if the file is somehow ending up in a corrupted state when you kill the program.

jswhit avatar Aug 20 '19 20:08 jswhit

I agree that would be wiser! :) I should probably open/close the file in every iteration. I just did a quick thing ;)

zerothi avatar Aug 21 '19 08:08 zerothi

Can we get some C code to reproduce this?

edhartnett avatar Dec 01 '19 16:12 edhartnett