netcdf4-python
netcdf4-python copied to clipboard
Writing to multiple unlimited dimension variables, seg-fault and/or free() invalid next size
I am trying to do incremental work in a nc file (each element takes a considerable amount of time, so I would like to do them when I have the time).
I have this small snippet which seems to reproduce it:
import os
import numpy as np
import netCDF4 as nc
def func(E):
return [np.random.rand(2, 2) for e in E]
Nk = np.arange(30, 2401)
ETA = np.array([1.e-6, 2.5e-6, 5e-6, 7.5e-6, 1.e-5, 2.5e-5, 5e-5, 7.5e-5, 1.e-4, 2.5e-4, 5e-4, 7.5e-4, 1e-3])
E = np.linspace(-3, 3, 300)
# Create a new file containing all Gf's calculate
if os.path.isfile('TestFile.nc'):
f = nc.Dataset('TestFile.nc', 'a', format='NETCDF4')
varnk = f.variables['nk']
vareta = f.variables['eta']
varcompleted = f.variables['completed']
bandgrp = f.groups['band']
bandG = bandgrp.variables['G']
else:
f = nc.Dataset('TestFile.nc', 'w', format='NETCDF4')
f.createDimension("E", None)
f.createDimension("nk", None)
f.createDimension("eta", None)
f.createDimension("no", 2)
varE = f.createVariable("E", "f8", ("E"))
varE[:] = E[:]
varnk = f.createVariable("nk", "i4", ("nk"))
vareta = f.createVariable("eta", "f8", ("eta"))
varcompleted = f.createVariable("completed", "i4", ("eta", "nk"))
bandgrp = f.createGroup('band')
bandG = bandgrp.createVariable("G","f8",("eta", "nk", "E", "no", "no"))
print(f)
def add_value(f, variable, value):
if variable.shape == (0,):
variable[0] = value
return 0
idx = (variable[:] == value).nonzero()[0]
if len(idx) == 1:
return idx[0]
variable[variable.shape[0]] = value
return variable.shape[0] - 1
# Now perform timing and calculate maximum differences between the two methods
for eta in ETA:
print('Running for eta = {}'.format(eta))
idxeta = add_value(f, vareta, eta)
E = E.real + 1j * eta
for nk in Nk:
idxnk = add_value(f, varnk, nk)
if varcompleted[idxeta, idxnk] == 1:
# we already have it calculated
continue
varcompleted[idxeta, idxnk] = 0
f.sync()
Gf = func(E)
bandG[idxeta, idxnk, :, :, :] = [g for g in Gf]
varcompleted[idxeta, idxnk] = 1
f.sync()
Sometimes I get a seg-fault, and other times I get free(): invalid next size (fast)
. None of which are exactly reproduce able to a specific index or.
My work flow is:
- Run script for some time, then kill it (
sync
should ensure everything is fine) - Re-run script which skips the already calculated elements
It sometimes fails from initial calculation (i.e. without restart), and sometimes from the restart.
I have tried a gdb
run and it gives something like:
#0 0x00001555551917bb in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x000015555517c535 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00001555551d3508 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00001555551d9c1a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00001555551db4d6 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x0000155553fd0004 in PyDataMem_FREE (ptr=0x1cb1220) at numpy/core/src/multiarray/alloc.c:264
#6 _npy_free_cache (dealloc=<optimized out>, cache=0x155554789900 <datacache>, msz=1024,
nelem=<optimized out>, p=0x1cb1220) at numpy/core/src/multiarray/alloc.c:104
#7 npy_free_cache (p=0x1cb1220, sz=<optimized out>) at numpy/core/src/multiarray/alloc.c:139
#8 0x0000155553fd7c53 in array_dealloc (self=0x15553b652a80)
at numpy/core/include/numpy/ndarraytypes.h:1490
#9 0x000000000045cae8 in list_dealloc (op=0x15553a9d6dc8) at ../Objects/listobject.c:324
#10 0x0000000000485b67 in insertdict (value=<optimized out>, hash=<optimized out>,
key=0x155554a6b7d8, mp=<optimized out>) at ../Objects/dictobject.c:1076
#11 PyDict_SetItem (op=<optimized out>, key=0x155554a6b7d8, value=<optimized out>)
at ../Objects/dictobject.c:1463
#12 0x000000000042d6bf in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
at ../Python/ceval.c:1935
#13 0x000000000054ccd7 in _PyEval_EvalCodeWithName (_co=_co@entry=0x155554981300,
globals=globals@entry=0x155554aab240, locals=locals@entry=0x155554aab240,
args=args@entry=0x0, argcount=argcount@entry=0, kwnames=kwnames@entry=0x0, kwargs=0x0,
kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0)
at ../Python/ceval.c:3930
#14 0x000000000054d50e in PyEval_EvalCodeEx (_co=_co@entry=0x155554981300,
globals=globals@entry=0x155554aab240, locals=locals@entry=0x155554aab240,
args=args@entry=0x0, argcount=argcount@entry=0, kws=kws@entry=0x0, kwcount=0, defs=0x0,
defcount=0, kwdefs=0x0, closure=0x0) at ../Python/ceval.c:3959
#15 0x000000000054d53b in PyEval_EvalCode (co=co@entry=0x155554981300,
globals=globals@entry=0x155554aab240, locals=locals@entry=0x155554aab240)
at ../Python/ceval.c:524
#16 0x000000000058bd31 in run_mod (arena=0x155554ba2078, flags=0x7fffffff467c,
locals=0x155554aab240, globals=0x155554aab240, filename=0x1555549143f8, mod=0x89a628)
at ../Python/pythonrun.c:1035
#17 PyRun_FileExFlags (fp=fp@entry=0x7a9620,
filename_str=filename_str@entry=0x155554908878 "/home/nicpa/articles/rs-se/run_rsGF.py",
start=start@entry=257, globals=globals@entry=0x155554aab240,
locals=locals@entry=0x155554aab240, closeit=closeit@entry=1, flags=0x7fffffff467c)
at ../Python/pythonrun.c:988
#18 0x000000000058bec2 in PyRun_SimpleFileExFlags (fp=fp@entry=0x7a9620,
filename=<optimized out>, closeit=closeit@entry=1, flags=flags@entry=0x7fffffff467c)
at ../Python/pythonrun.c:429
#19 0x000000000058c364 in PyRun_AnyFileExFlags (fp=fp@entry=0x7a9620, filename=<optimized out>,
closeit=closeit@entry=1, flags=flags@entry=0x7fffffff467c) at ../Python/pythonrun.c:84
#20 0x000000000043a4b0 in pymain_run_file (p_cf=0x7fffffff467c, filename=<optimized out>,
fp=0x7a9620) at ../Modules/main.c:427
#21 pymain_run_filename (cf=0x7fffffff467c, pymain=0x7fffffff4750) at ../Modules/main.c:1627
#22 pymain_run_python (pymain=0x7fffffff4750) at ../Modules/main.c:2877
#23 pymain_main (pymain=pymain@entry=0x7fffffff4750) at ../Modules/main.c:3038
#24 0x000000000043a6fe in _Py_UnixMain (argc=<optimized out>, argv=<optimized out>)
at ../Modules/main.c:3073
#25 0x000015555517e09b in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#26 0x0000000000430a9a in _start ()
I don't know how relevant this is.
But perhaps a reference counting for a numpy array and garbage collection in netcdf file vs numpy is trickering this, (just a guess).
There are bugs in netcdf-c for writing to variables with multiple unlimited dimensions (see #933, https://github.com/Unidata/netcdf-c/issues/1413). Wouldn't be surprised if this is another one.
Thanks! Feel free to close this or if you want this to be open until fixed upstream. :)
Will leave this open - I'm not sure this a bug upstream. I do wonder if the file is somehow ending up in a corrupted state when you kill the program.
I agree that would be wiser! :) I should probably open/close the file in every iteration. I just did a quick thing ;)
Can we get some C code to reproduce this?