PlaceholderArray encountered in `BitMaskedArray.to_ByteMaskedArray` when it shouldn't be
Here's a reproducer:
import awkward as ak
ak.concatenate([
ak.from_parquet("one.parquet")["goodjets"],
ak.from_parquet("two.parquet")["goodjets"],
])
succeeds but
import awkward as ak
import dask_awkward as dak
ak.concatenate([
dak.from_parquet("one.parquet")["goodjets"],
dak.from_parquet("two.parquet")["goodjets"],
]).compute()
fails with
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/jpivarski/miniforge3/lib/python3.11/site-packages/dask/base.py", line 376, in compute
(result,) = compute(self, traverse=False, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jpivarski/miniforge3/lib/python3.11/site-packages/dask/base.py", line 664, in compute
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jpivarski/miniforge3/lib/python3.11/site-packages/dask/base.py", line 664, in <listcomp>
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
^^^^^^^^
File "/home/jpivarski/miniforge3/lib/python3.11/site-packages/dask_awkward/lib/core.py", line 830, in _finalize_array
return ak.concatenate(results)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jpivarski/irishep/awkward/src/awkward/_dispatch.py", line 64, in dispatch
next(gen_or_result)
File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_concatenate.py", line 64, in concatenate
return _impl(arrays, axis, mergebool, highlevel, behavior, attrs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_concatenate.py", line 160, in _impl
contents = [ak._do.mergemany(b) for b in batches]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_concatenate.py", line 160, in <listcomp>
contents = [ak._do.mergemany(b) for b in batches]
^^^^^^^^^^^^^^^^^^^
File "/home/jpivarski/irishep/awkward/src/awkward/_do.py", line 218, in mergemany
return contents[0]._mergemany(contents[1:])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jpivarski/irishep/awkward/src/awkward/contents/listoffsetarray.py", line 808, in _mergemany
out = listarray._mergemany(others)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jpivarski/irishep/awkward/src/awkward/contents/listarray.py", line 1128, in _mergemany
nextcontent = contents[0]._mergemany(tail_contents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jpivarski/irishep/awkward/src/awkward/contents/recordarray.py", line 723, in _mergemany
trimmed = field[0 : array.length]
~~~~~^^^^^^^^^^^^^^^^^^
File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 512, in __getitem__
return self._getitem(where)
^^^^^^^^^^^^^^^^^^^^
File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 523, in _getitem
return self._getitem_range(start, stop)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jpivarski/irishep/awkward/src/awkward/contents/bitmaskedarray.py", line 493, in _getitem_range
return self.to_ByteMaskedArray()._getitem_range(start, stop)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jpivarski/irishep/awkward/src/awkward/contents/bitmaskedarray.py", line 384, in to_ByteMaskedArray
self._backend[
File "/home/jpivarski/irishep/awkward/src/awkward/_kernels.py", line 91, in __call__
return self._impl(
^^^^^^^^^^^
File "/home/jpivarski/irishep/awkward/src/awkward/_kernels.py", line 92, in <genexpr>
*(self._cast(x, t) for x, t in zip(args, self._impl.argtypes))
^^^^^^^^^^^^^^^^
File "/home/jpivarski/irishep/awkward/src/awkward/_kernels.py", line 82, in _cast
raise AssertionError(
AssertionError: Only NumPy buffers should be passed to Numpy Kernels, received PyCPointerType
Going into more detail, the troublemaker is self._mask.data, which is a PlaceholderArray. The rehydration must be saying that this buffer is not needed, but it is needed. The concatenation needs to know which array elements are missing.
Another indicator: passing optimize_graph=False to compute make it work. It's the column-optimization.
Is it expected to be possible to do concatenate on typetracers? It should be needed, since we will need to know the columns to select from both input layers independently - we have no mechanism to carry the columns found needed for one layer across to another.
So far I have found these edges:
- in concatenate, we were always going on the "enforce form" route even if the inputs are the same, which is not really a problem, but easy to fix
- the concatenated dak object has two partitions - so effectively a list of typetracers rather than a combined typetracer. (I think) we only ever touch the first; this may cause the following problem
- the output has read data in partition 0 and a typetracer in partition 1, causing the exception.
So indeed, the second partition is receiving columns=[] (nothing to load), and it unproject_layout is turning all the missing columns into typetracers.
I can confirm that the one-pass branch successfully computes the second failing case, but only_the_first_time. Subsequent computes fail. The fail mode in having no required columns passed to parquet at all. Calling dak.core.dak_cache.clear() causes it to pass again, so we have a good hint of where the problem is.
Given that #526 has a variant of this same problem, is it time to dust off the one-pass PR?
(I should say, that a trivial, but not great, workaround for the issue here is to touch all inputs to a concatenate, which somehow is what the other linked issue ended up doing (because of axis=1, presumably).
Hi @jpivarski / @martindurant, just to revive this:
I've run into the PlaceholderArray/PyCPointerType once again, though this time concatenate doesn't seem to be to blame. Basically, I have somehow created a set of parquet files that really propagate this placeholder trouble. If I try to do any manipulation on them (say, divide), I get the following error:
AssertionError: Only NumPy buffers should be passed to Numpy Kernels, received PyCPointerType
Backtracing this a little more, even just reading the arrays from a parquet file seems problematic. To be specific, if I do
file = ak.from_parquet(path)
print(file['ungroomed_ecfs'])
Everything is fine. But when reading in from delayed, though, I get errors.
Doing
file = dak.from_parquet(path)
print(file.compute()['ungroomed_ecfs'])
Fails with the error TypeError: PlaceholderArray supports only trivial slices, not int.
Do these perhaps stem from the same place? I had never noticed this before because I never had a reason to read these files with dask awkward/delayed. I can provide example parquet files if it would be helpful. Extended errors I'll attach. pyc_error.txt TypeError.txt
Ah so this means that the parquet dak-array doesn't know to materialize all the data.
Can you try file["ungroomed_ecfs"].compute() instead?
Quick question: you mention "delayed" here, but I don't see any calls to delayed - you mean lazy operations, or did I miss something?
Simply failing to compute data loaded from parquet without any further operations would be bad. Can you share the file?
Indeed @lgray , if may be that the single-layer case is an edge we hadn't considered.
Sorry, looks like I got my terminology mixed up. I did mean lazy-- reading the parquet files lazily with dak.from_parquet(*,*,compute=False).
@lgray I tried reversing the order as you suggested and while the error is gone, it just produces an empty array:
>>> file["ungroomed_ecfs"].compute()
-------------------------
type: 0 * {
"1e2^0.5": ?float64,
"1e2^1.0": ?float64,
"1e2^1.5": ?float64,
"1e2^2.0": ?float64,
"1e2^2.5": ?float64,
"1e2^3.0": ?float64,
"1e2^3.5": ?float64,
.
.
.
Which is odd and definitely incorrect and not what happens when I read eagerly.
I've uploaded the example parquet file to google drive, it should be accessible to anyone with the link: https://drive.google.com/drive/folders/1548z0m6IYuIKUA1EzfgG3_OylpPK8T3U?usp=sharing
That's certainly weird. We'll try to get back to you soon on it.
In the meantime can you use dask-dataframe or dask-array, if it's just a bunch of flat lists of floats?
Yeah, I should be able to convert for now. Though maybe also helpful to point out, Jim mentioned to me in mattermost a few months ago (and above) that passing optimize_graph=False is a workaround, and that actually is still true here. If I do
>>>file["ungroomed_ecfs"].compute(optimize_graph=False)
[{'1e2^0.5': 0.15, '1e2^1.0': 0.0632, '1e2^1.5': 0.0305, '1e2^2.0': ..., ...},
{'1e2^0.5': 0.199, '1e2^1.0': 0.116, '1e2^1.5': 0.0754, '1e2^2.0': ..., ...},
{'1e2^0.5': 0.24, '1e2^1.0': 0.143, '1e2^1.5': 0.0939, '1e2^2.0': 0.0656, ...},
{'1e2^0.5': 0.221, '1e2^1.0': 0.135, '1e2^1.5': 0.0912, '1e2^2.0': ..., ...},
{'1e2^0.5': 0.26, '1e2^1.0': 0.166, '1e2^1.5': 0.114, '1e2^2.0': 0.0822, ...},
{'1e2^0.5': 0.223, '1e2^1.0': 0.136, '1e2^1.5': 0.0924, '1e2^2.0': ..., ...},
{'1e2^0.5': 0.267, '1e2^1.0': 0.169, '1e2^1.5': 0.116, '1e2^2.0': 0.0833, ...},
{'1e2^0.5': 0.249, '1e2^1.0': 0.148, '1e2^1.5': 0.0957, '1e2^2.0': ..., ...},
{'1e2^0.5': 0.265, '1e2^1.0': 0.174, '1e2^1.5': 0.124, '1e2^2.0': 0.0932, ...},
{'1e2^0.5': 0.186, '1e2^1.0': 0.0837, '1e2^1.5': 0.0404, '1e2^2.0': ..., ...},
...,
{'1e2^0.5': 0.212, '1e2^1.0': 0.112, '1e2^1.5': 0.0649, '1e2^2.0': ..., ...},
{'1e2^0.5': 0.256, '1e2^1.0': 0.158, '1e2^1.5': 0.106, '1e2^2.0': 0.0755, ...},
{'1e2^0.5': 0.26, '1e2^1.0': 0.185, '1e2^1.5': 0.143, '1e2^2.0': 0.115, ...},
{'1e2^0.5': 0.229, '1e2^1.0': 0.152, '1e2^1.5': 0.116, '1e2^2.0': 0.0944, ...},
{'1e2^0.5': 0.201, '1e2^1.0': 0.123, '1e2^1.5': 0.09, '1e2^2.0': 0.0718, ...},
{'1e2^0.5': 0.17, '1e2^1.0': 0.107, '1e2^1.5': 0.0754, '1e2^2.0': 0.055, ...},
{'1e2^0.5': 0.202, '1e2^1.0': 0.102, '1e2^1.5': 0.0562, '1e2^2.0': ..., ...},
{'1e2^0.5': 0.256, '1e2^1.0': 0.159, '1e2^1.5': 0.107, '1e2^2.0': 0.0751, ...},
{'1e2^0.5': 0.234, '1e2^1.0': 0.148, '1e2^1.5': 0.104, '1e2^2.0': 0.0763, ...}]
--------------------------------------------------------------------------------
type: 13050 * {
"1e2^0.5": ?float64,
"1e2^1.0": ?float64,
"1e2^1.5": ?float64,
"1e2^2.0": ?float64,
"1e2^2.5": ?float64,
.
.
.
Works as intended. I forgot he mentioned that until I went digging for it.
So the problem appears to be the keys/record labels. I went through and changed all the keys (removing the caret and the period) and now the problem is gone:
>>>file.compute()['ungroomed_ecfs']
{'1e205': 0.256, '1e210': 0.159, '1e215': 0.107, '1e220': 0.0751, ...},
{'1e205': 0.234, '1e210': 0.148, '1e215': 0.104, '1e220': 0.0763, ...}]
--------------------------------------------------------------------------
type: 13050 * {
"1e205": float64,
.
.
.
and
>>>file["ungroomed_ecfs"].compute()
{'1e205': 0.256, '1e210': 0.159, '1e215': 0.107, '1e220': 0.0751, ...},
{'1e205': 0.234, '1e210': 0.148, '1e215': 0.104, '1e220': 0.0763, ...}]
--------------------------------------------------------------------------
type: 13050 * {
"1e205": float64,
.
.
.
Now work as intended without having to give optimize_graph=False
@pfackeldey , so this sounds like a mapper issue already in the v1 ("two-pass") column optimization. Maybe it works with v2? Field names containing "." are allowed in the parquet spec, although they are unusual, and other frameworks also use "field.subfield" as a shorthand for ["field", "subfield"]. At som point, @jpivarski suggested using syntax like "field.subfield.with.dots" for the unusual but general case. To me, this feels unwieldy, but I have no better suggestion except the list.