awkward
awkward copied to clipboard
ak.sum segfault: backport v2 solution into v1
Version of Awkward Array
1.7.0
Description and code to reproduce
import awkward as ak
import numpy as np
v = ak.from_iter([[[1, 2, 3], [4, 5]], [[3, 4, 5], [6, 7]], [[5, 6, 7], [8, 9]]])
M = np.asarray([[1, 1.4, -0.3], [1.4, 1, 1.2], [-0.3, 1.2, 1]])
M_times_v = M[..., np.newaxis, np.newaxis] * v[np.newaxis, ...]
v_times_M_times_v = v[:, np.newaxis, ...] * M_times_v
# frequently (not always) segfaults with axis=0 or 1
np.sum(v_times_M_times_v, axis=0)
array = ak.from_iter([[[[1], [4, 9]], [[5.6], [14, 25.2]]], [[[5.6], [14, 25.2]], [[16], [25, 36]]]])
# always(?) segfaults with axis=0 or 1, also when using np.sum
ak.sum(array, axis=1)
The above code causes frequent segmentation faults in the first np.sum
call, and always (at least in ~100 attempts) segfaults for the second ak.sum
call (same when replacing this by np.sum
).
I can reproduce this with two setups:
- Python 3.9.10 on M1 macOS with
awkward
1.7.0 andnumpy
1.22.2 - Python 3.8.10 on Intel macOS with
awkward
1.7.0 andnumpy
1.22.1
I cannot reproduce the segmentation fault when using the following:
-
python:3.9-slim
container withawkward
1.7.0 andnumpy
1.22.2
In case it is useful, I rarely get the following output as well:
python(14372,0x102a64580) malloc: Heap corruption detected, free list is damaged at 0x600000358260
*** Incorrect guard value: 8
python(14372,0x102a64580) malloc: *** set a breakpoint in malloc_error_break to debug
In the original setup where I came across this (uses a larger array as input) I also see the following:
python(12981,0x105568580) malloc: *** error for object 0x3d: pointer being freed was not allocated
python(12981,0x105568580) malloc: *** set a breakpoint in malloc_error_break to debug
The original setup also crashes less frequently (did not measure this very precisely though).
I am not sure how to approach debugging this, but am happy to help with some pointers for what to do. I have also not tried other combinations of numpy
and awkward
yet or systematically investigated which kinds of arrays work and which cause crashes.
@alexander-held thanks for reporting it! @jpivarski - I think, this is related to https://github.com/scikit-hep/awkward-1.0/issues/1266
I can't reproduce this locally on Linux with:
-
ak.__version__ == '1.7.0'
(PyPI) and Python3.10.0
(Conda) andnp.__version__ == '1.21.5'
(PyPI) -
ak.__version__ == '1.7.0'
(Conda) and Python3.10.0
(Conda) andnp.__version__ == '1.22.2'
(Conda) -
ak.__version__ == '1.7.0'
(Conda) and Python3.9.10
(Conda) andnp.__version__ == '1.22.2'
(Conda) :star:
Where :star: matches your first specification.
The fact that it doesn't fail for you with the container, and it doesn't fail for me with the same environment (besides host) as your first failure makes me wonder whether this is a MacOS-specific failure.
@agoose77 - I think it's the same as in the case you've reported - the parents
are not ordered:
<ListArray64>
<starts><Index64 i="[0 15 30 5 20 35 10 25 40 3 18 33 8 23 38 13 28 43]" offset="0" length="18" at="0x7ff1f6023e30"/></starts>
<stops><Index64 i="[3 18 33 8 23 38 13 28 43 5 20 35 10 25 40 15 30 45]" offset="0" length="18" at="0x7ff1f6023ee0"/></stops>
<content><NumpyArray format="d" shape="45" data="1 4 9 16 25 ... 25 36 49 64 81" at="0x7ff1f600a630"/></content>
</ListArray64>
ListArrayOf::reduce_next negaxis = 2, starts = (0 ,9 ,3 ,12 ,6 ,15 ,), shifts = (), parents = (0, 0, 0, 2, 2, 2, 4, 4, 4, 1, 1, 1, 3, 3, 3, 5, 5, 5, )
but the algorithms calculating gaps
and distincts
assume they are - the nextstarts
and gaps
may get corrupted:
nextstarts:
<Index64 i="[0 18 36 9 27 902837359883296 3 21 39 12 30 3828118095694672928 6 24 42 15 33]" offset="0" length="17" at="0x7ff1f6024270"/>
gaps:
<Index64 i="[1 2 2 1 3756056159110504499 3828118095694672928]" offset="0" length="6" at="0x7ff1f60245e0"/>
The memory allocated on Linux might be already initialised... but not on MacOS.
Sometimes segfaulting is hard to catch, but if this is an error in the logic of reduction, then it would be more obvious in studies/reducers.py. Set the initial lists to be full of 999
and if there are any 999
left after the kernel, it's uninitialized (in a non-platform-dependent way). Also, out-of-bounds errors would be caught because reducers.py is Python.
This is a 3D sample with variable lengths, but if it's related to the 4D sample without variable lengths, one or the other may be easier to diagnose. I wonder if this 3D sample could be modified to have fixed lengths and still be an issue?
Also as a reminder, switching sum
with prod
and using all prime numbers makes it possible to trace back where each number came from—i.e. you can infer the carry
index from the carried content
. (With sum
, you'd need powers of n, which get big faster than primes.)
I can reproduce the segfault with v1 (in master and with Jim's branch jpivarski-ianna/fix-4D-reducers
):
import awkward as ak
import numpy as np
v = ak.from_iter([[[1, 2, 3], [4, 5]], [[3, 4, 5], [6, 7]], [[5, 6, 7], [8, 9]]])
M = np.asarray([[1, 1.4, -0.3], [1.4, 1, 1.2], [-0.3, 1.2, 1]])
M_times_v = M[..., np.newaxis, np.newaxis] * v[np.newaxis, ...]
v_times_M_times_v = v[:, np.newaxis, ...] * M_times_v
# frequently (not always) segfaults with axis=0 or 1
np.sum(v_times_M_times_v, axis=0)
<Array [[[3.7, 11.6, 23.7], ... 84.7], []]] type='3 * var * var * float64'>
array = ak.from_iter([[[[1], [4, 9]], [[5.6], [14, 25.2]]], [[[5.6], [14, 25.2]], [[16], [25, 36]]]])
# always(?) segfaults with axis=0 or 1, also when using np.sum
ak.sum(array, axis=1)
zsh: segmentation fault python
but not with v2:
import awkward as ak
import numpy as np
v = ak._v2.operations.convert.from_iter([[[1, 2, 3], [4, 5]], [[3, 4, 5], [6, 7]], [[5, 6, 7], [8, 9]]])
M = np.asarray([[1, 1.4, -0.3], [1.4, 1, 1.2], [-0.3, 1.2, 1]])
M_times_v = M[..., np.newaxis, np.newaxis] * v[np.newaxis, ...]
v_times_M_times_v = v[:, np.newaxis, ...] * M_times_v
ak._v2.sum(v_times_M_times_v, axis=0)
array = ak._v2.operations.convert.from_iter([[[[1], [4, 9]], [[5.6], [14, 25.2]]], [[[5.6], [14, 25.2]], [[16], [25, 36]]]])
ak._v2.sum(array, axis=1)
<Array [[[6.6], [18, 34.2]], [[21.6], [...]]] type='2 * var * var * float64'>
Sometimes segfaulting is hard to catch, but if this is an error in the logic of reduction, then it would be more obvious in studies/reducers.py. Set the initial lists to be full of
999
and if there are any999
left after the kernel, it's uninitialized (in a non-platform-dependent way). Also, out-of-bounds errors would be caught because reducers.py is Python.This is a 3D sample with variable lengths, but if it's related to the 4D sample without variable lengths, one or the other may be easier to diagnose. I wonder if this 3D sample could be modified to have fixed lengths and still be an issue?
Also as a reminder, switching
sum
withprod
and using all prime numbers makes it possible to trace back where each number came from—i.e. you can infer thecarry
index from the carriedcontent
. (Withsum
, you'd need powers of n, which get big faster than primes.)
I've added a 4D test in the studies/reducers.py that fails.
We've got 3.5 months before v2 comes out and not a lot of clamor about this issue. Unless somebody says, "This is affecting my workflow, please fix it!" I think we can just ignore it. That request-for-fix could be after the 2.0 release, we will be providing bug-fixes on demand (through a main-v1
git branch).