awkward icon indicating copy to clipboard operation
awkward copied to clipboard

ak.sum segfault: backport v2 solution into v1

Open alexander-held opened this issue 3 years ago • 6 comments

Version of Awkward Array

1.7.0

Description and code to reproduce

import awkward as ak
import numpy as np

v = ak.from_iter([[[1, 2, 3], [4, 5]], [[3, 4, 5], [6, 7]], [[5, 6, 7], [8, 9]]])
M = np.asarray([[1, 1.4, -0.3], [1.4, 1, 1.2], [-0.3, 1.2, 1]])
M_times_v = M[..., np.newaxis, np.newaxis] * v[np.newaxis, ...]
v_times_M_times_v = v[:, np.newaxis, ...] * M_times_v

# frequently (not always) segfaults with axis=0 or 1
np.sum(v_times_M_times_v, axis=0)

array = ak.from_iter([[[[1], [4, 9]], [[5.6], [14, 25.2]]], [[[5.6], [14, 25.2]], [[16], [25, 36]]]])
# always(?) segfaults with axis=0 or 1, also when using np.sum
ak.sum(array, axis=1)

The above code causes frequent segmentation faults in the first np.sum call, and always (at least in ~100 attempts) segfaults for the second ak.sum call (same when replacing this by np.sum).

I can reproduce this with two setups:

  • Python 3.9.10 on M1 macOS with awkward 1.7.0 and numpy 1.22.2
  • Python 3.8.10 on Intel macOS with awkward 1.7.0 and numpy 1.22.1

I cannot reproduce the segmentation fault when using the following:

  • python:3.9-slim container with awkward 1.7.0 and numpy 1.22.2

In case it is useful, I rarely get the following output as well:

python(14372,0x102a64580) malloc: Heap corruption detected, free list is damaged at 0x600000358260
*** Incorrect guard value: 8
python(14372,0x102a64580) malloc: *** set a breakpoint in malloc_error_break to debug

In the original setup where I came across this (uses a larger array as input) I also see the following:

python(12981,0x105568580) malloc: *** error for object 0x3d: pointer being freed was not allocated
python(12981,0x105568580) malloc: *** set a breakpoint in malloc_error_break to debug

The original setup also crashes less frequently (did not measure this very precisely though).

I am not sure how to approach debugging this, but am happy to help with some pointers for what to do. I have also not tried other combinations of numpy and awkward yet or systematically investigated which kinds of arrays work and which cause crashes.

alexander-held avatar Feb 10 '22 13:02 alexander-held

@alexander-held thanks for reporting it! @jpivarski - I think, this is related to https://github.com/scikit-hep/awkward-1.0/issues/1266

ianna avatar Feb 10 '22 13:02 ianna

I can't reproduce this locally on Linux with:

  • ak.__version__ == '1.7.0' (PyPI) and Python 3.10.0 (Conda) and np.__version__ == '1.21.5' (PyPI)
  • ak.__version__ == '1.7.0' (Conda) and Python 3.10.0 (Conda) and np.__version__ == '1.22.2' (Conda)
  • ak.__version__ == '1.7.0' (Conda) and Python 3.9.10 (Conda) and np.__version__ == '1.22.2' (Conda) :star:

Where :star: matches your first specification.

The fact that it doesn't fail for you with the container, and it doesn't fail for me with the same environment (besides host) as your first failure makes me wonder whether this is a MacOS-specific failure.

agoose77 avatar Feb 10 '22 13:02 agoose77

@agoose77 - I think it's the same as in the case you've reported - the parents are not ordered:

<ListArray64>
    <starts><Index64 i="[0 15 30 5 20 35 10 25 40 3 18 33 8 23 38 13 28 43]" offset="0" length="18" at="0x7ff1f6023e30"/></starts>
    <stops><Index64 i="[3 18 33 8 23 38 13 28 43 5 20 35 10 25 40 15 30 45]" offset="0" length="18" at="0x7ff1f6023ee0"/></stops>
    <content><NumpyArray format="d" shape="45" data="1 4 9 16 25 ... 25 36 49 64 81" at="0x7ff1f600a630"/></content>
</ListArray64>
ListArrayOf::reduce_next negaxis = 2, starts = (0 ,9 ,3 ,12 ,6 ,15 ,), shifts = (), parents = (0, 0, 0, 2, 2, 2, 4, 4, 4, 1, 1, 1, 3, 3, 3, 5, 5, 5, )

but the algorithms calculating gaps and distincts assume they are - the nextstarts and gaps may get corrupted:

nextstarts:
<Index64 i="[0 18 36 9 27 902837359883296 3 21 39 12 30 3828118095694672928 6 24 42 15 33]" offset="0" length="17" at="0x7ff1f6024270"/>
gaps:
<Index64 i="[1 2 2 1 3756056159110504499 3828118095694672928]" offset="0" length="6" at="0x7ff1f60245e0"/>

The memory allocated on Linux might be already initialised... but not on MacOS.

ianna avatar Feb 10 '22 14:02 ianna

Sometimes segfaulting is hard to catch, but if this is an error in the logic of reduction, then it would be more obvious in studies/reducers.py. Set the initial lists to be full of 999 and if there are any 999 left after the kernel, it's uninitialized (in a non-platform-dependent way). Also, out-of-bounds errors would be caught because reducers.py is Python.

This is a 3D sample with variable lengths, but if it's related to the 4D sample without variable lengths, one or the other may be easier to diagnose. I wonder if this 3D sample could be modified to have fixed lengths and still be an issue?

Also as a reminder, switching sum with prod and using all prime numbers makes it possible to trace back where each number came from—i.e. you can infer the carry index from the carried content. (With sum, you'd need powers of n, which get big faster than primes.)

jpivarski avatar Feb 10 '22 14:02 jpivarski

I can reproduce the segfault with v1 (in master and with Jim's branch jpivarski-ianna/fix-4D-reducers):

import awkward as ak
import numpy as np

v = ak.from_iter([[[1, 2, 3], [4, 5]], [[3, 4, 5], [6, 7]], [[5, 6, 7], [8, 9]]])
M = np.asarray([[1, 1.4, -0.3], [1.4, 1, 1.2], [-0.3, 1.2, 1]])
M_times_v = M[..., np.newaxis, np.newaxis] * v[np.newaxis, ...]
v_times_M_times_v = v[:, np.newaxis, ...] * M_times_v

# frequently (not always) segfaults with axis=0 or 1
np.sum(v_times_M_times_v, axis=0)
<Array [[[3.7, 11.6, 23.7], ... 84.7], []]] type='3 * var * var * float64'>
array = ak.from_iter([[[[1], [4, 9]], [[5.6], [14, 25.2]]], [[[5.6], [14, 25.2]], [[16], [25, 36]]]])
# always(?) segfaults with axis=0 or 1, also when using np.sum
ak.sum(array, axis=1)
zsh: segmentation fault  python

but not with v2:

import awkward as ak
import numpy as np

v = ak._v2.operations.convert.from_iter([[[1, 2, 3], [4, 5]], [[3, 4, 5], [6, 7]], [[5, 6, 7], [8, 9]]])
M = np.asarray([[1, 1.4, -0.3], [1.4, 1, 1.2], [-0.3, 1.2, 1]])
M_times_v = M[..., np.newaxis, np.newaxis] * v[np.newaxis, ...]
v_times_M_times_v = v[:, np.newaxis, ...] * M_times_v

ak._v2.sum(v_times_M_times_v, axis=0)
array = ak._v2.operations.convert.from_iter([[[[1], [4, 9]], [[5.6], [14, 25.2]]], [[[5.6], [14, 25.2]], [[16], [25, 36]]]])
ak._v2.sum(array, axis=1)
<Array [[[6.6], [18, 34.2]], [[21.6], [...]]] type='2 * var * var * float64'>

ianna avatar Feb 10 '22 14:02 ianna

Sometimes segfaulting is hard to catch, but if this is an error in the logic of reduction, then it would be more obvious in studies/reducers.py. Set the initial lists to be full of 999 and if there are any 999 left after the kernel, it's uninitialized (in a non-platform-dependent way). Also, out-of-bounds errors would be caught because reducers.py is Python.

This is a 3D sample with variable lengths, but if it's related to the 4D sample without variable lengths, one or the other may be easier to diagnose. I wonder if this 3D sample could be modified to have fixed lengths and still be an issue?

Also as a reminder, switching sum with prod and using all prime numbers makes it possible to trace back where each number came from—i.e. you can infer the carry index from the carried content. (With sum, you'd need powers of n, which get big faster than primes.)

I've added a 4D test in the studies/reducers.py that fails.

ianna avatar Feb 10 '22 14:02 ianna

We've got 3.5 months before v2 comes out and not a lot of clamor about this issue. Unless somebody says, "This is affecting my workflow, please fix it!" I think we can just ignore it. That request-for-fix could be after the 2.0 release, we will be providing bug-fixes on demand (through a main-v1 git branch).

jpivarski avatar Aug 15 '22 18:08 jpivarski