dask-awkward Ungraceful failing when data isn't properly touched

I'm attempting to implement a function over an awkward collection and exposing it to dask_awkward. I've noticed that if the correct touching is not implemented (see the arr.y.layout._touch_data line), then this following program Seg-faults.

import awkward
import dask_awkward
import numpy


def ak_f(arr):
    return arr.x + arr.y


def dak_f(arr):
    class _call_wrap:
        def __call__(self, arr):
            if awkward.backend(arr) == "typetracer":
                # Running the touch function
                arr.x.layout._touch_data(recursive=True)
                #arr.y.layout._touch_data(recursive=True) ## Intensionally disabling for demonstration

                # Getting the length-0 array for evaluation
                x = arr.layout.form.length_zero_array(behavior=arr.behavior)
                out = ak_f(x)
                return awkward.Array(
                    out.layout.to_typetracer(forget_length=True),
                    behavior=out.behavior,
                )
            else:
                return ak_f(arr)

    return dask_awkward.lib.core.map_partitions(_call_wrap(), arr, label="test_dak")


ak_arr = awkward.zip(
    {
        "x": numpy.random.random(size=(10, 10)),
        "y": numpy.random.random(size=(10, 10)),
    }
)

print(ak_f(ak_arr))

awkward.to_parquet(ak_arr, "test.parquet")
dak_arr = dask_awkward.from_parquet("test.parquet")  # Making a lazy instance
print(dak_f(dak_arr).compute())

For the shorter term, I think it is important if these failures are more gracefully with some user viewable error message rather than a hard segfault.

For the longer term, it would also be nice if touching of collection inputs can be handled automatically.

Apr 26 '23 13:04 yimuchen

Just to comment on this - the functions being called in the real version of this example are ML frameworks where we are forced to go through length_zero_arrays in order to pass through the ML framework and then reassemble the output.

Simply making sure that no one gets hung up on the simplicity of the function being called in the repro (which could be handled by typetracing automatically).

Apr 28 '23 17:04 lgray

I can confirm that I see this issue on current main dask-awkward

Apr 28 '23 19:04 martindurant

Is this still an issue? Things have evolved a lot w.r.t. typetracer arrays and data touching.

Oct 24 '23 13:10 douglasdavis

This is still an issue for dask_awkward==2023.10.1 and awkward==2.4.6.

Oct 24 '23 14:10 yimuchen

I think this needs to be recast in some of the more modern idioms (length_zero_if_typetracer and such). That should hide some of the more touchy bits of the interface and ensure consistency.

you can now achieve the same thing via the code below, with no need for a wrapper:

import awkward
import dask_awkward
import numpy

def ak_f(arr):
    if isinstance(arr, dask_awkward.Array):
        return arr.x + arr.y
    return awkard.typetracer.length_zero_if_typetracer(arr.x) + awkward.typetracer.length_zero_if_typetracer(arr.y)

ak_arr = awkward.zip(
    {
        "x": numpy.random.random(size=(10, 10)),
        "y": numpy.random.random(size=(10, 10)),
    }
)

print(ak_f(ak_arr))

awkward.to_parquet(ak_arr, "test.parquet")
dak_arr = dask_awkward.from_parquet("test.parquet")  # Making a lazy instance
print(ak_f(dak_arr).compute())

Oct 24 '23 14:10 lgray

I think the segfault when data for an array isn't available is a lower level awkward issue. We should handle it more gracefully there?

Oct 24 '23 14:10 lgray

Can we close this?

Mar 06 '24 15:03 agoose77

dask-awkward dask-awkward copied to clipboard

Ungraceful failing when data isn't properly touched

dask-awkward
dask-awkward copied to clipboard