dask-awkward
dask-awkward copied to clipboard
Ungraceful failing when data isn't properly touched
I'm attempting to implement a function over an awkward collection and exposing it to dask_awkward
. I've noticed that if the correct touching is not implemented (see the arr.y.layout._touch_data
line), then this following program Seg-faults.
import awkward
import dask_awkward
import numpy
def ak_f(arr):
return arr.x + arr.y
def dak_f(arr):
class _call_wrap:
def __call__(self, arr):
if awkward.backend(arr) == "typetracer":
# Running the touch function
arr.x.layout._touch_data(recursive=True)
#arr.y.layout._touch_data(recursive=True) ## Intensionally disabling for demonstration
# Getting the length-0 array for evaluation
x = arr.layout.form.length_zero_array(behavior=arr.behavior)
out = ak_f(x)
return awkward.Array(
out.layout.to_typetracer(forget_length=True),
behavior=out.behavior,
)
else:
return ak_f(arr)
return dask_awkward.lib.core.map_partitions(_call_wrap(), arr, label="test_dak")
ak_arr = awkward.zip(
{
"x": numpy.random.random(size=(10, 10)),
"y": numpy.random.random(size=(10, 10)),
}
)
print(ak_f(ak_arr))
awkward.to_parquet(ak_arr, "test.parquet")
dak_arr = dask_awkward.from_parquet("test.parquet") # Making a lazy instance
print(dak_f(dak_arr).compute())
For the shorter term, I think it is important if these failures are more gracefully with some user viewable error message rather than a hard segfault.
For the longer term, it would also be nice if touching of collection inputs can be handled automatically.
Just to comment on this - the functions being called in the real version of this example are ML frameworks where we are forced to go through length_zero_arrays in order to pass through the ML framework and then reassemble the output.
Simply making sure that no one gets hung up on the simplicity of the function being called in the repro (which could be handled by typetracing automatically).
I can confirm that I see this issue on current main dask-awkward
Is this still an issue? Things have evolved a lot w.r.t. typetracer arrays and data touching.
This is still an issue for dask_awkward==2023.10.1
and awkward==2.4.6
.
I think this needs to be recast in some of the more modern idioms (length_zero_if_typetracer and such). That should hide some of the more touchy bits of the interface and ensure consistency.
you can now achieve the same thing via the code below, with no need for a wrapper:
import awkward
import dask_awkward
import numpy
def ak_f(arr):
if isinstance(arr, dask_awkward.Array):
return arr.x + arr.y
return awkard.typetracer.length_zero_if_typetracer(arr.x) + awkward.typetracer.length_zero_if_typetracer(arr.y)
ak_arr = awkward.zip(
{
"x": numpy.random.random(size=(10, 10)),
"y": numpy.random.random(size=(10, 10)),
}
)
print(ak_f(ak_arr))
awkward.to_parquet(ak_arr, "test.parquet")
dak_arr = dask_awkward.from_parquet("test.parquet") # Making a lazy instance
print(ak_f(dak_arr).compute())
I think the segfault when data for an array isn't available is a lower level awkward issue. We should handle it more gracefully there?
Can we close this?