arrow
arrow copied to clipboard
GH-40407: [JS] Fix string coercion in MapRowProxyHandler.ownKeys
Rationale for this change
The ProxyHandler.ownKeys
implementation must return strings or symbols. Because of this bug, it was returning numbers, causing the in
operator to crash when trying to iterate over the keys of a MapRow
object.
An example of this is a DuckDB SQL query using the HISTOGRAM
operator:
SELECT HISTOGRAM(phot_g_mean_mag) FROM gaia
What changes are included in this PR?
Instead of calling array.map(String)
, which returns a typed array of non-strings when array
is a typed array, call Array.from
which is guaranteed to return strings.
Are these changes tested?
Apologies, but I don’t know how to test this.
Are there any user-facing changes?
This fixes a crash when using the in
operator on a MapRow
object.
- GitHub Issue: #40407
:warning: GitHub issue #40407 has been automatically assigned in GitHub to PR creator.
LGTM, thanks @mbostock!
Woot, thanks for the approval @trxcllnt. I’m excited to contribute. 😁
IIRC the Map keys are required to be strings (tho I'm not sure we prevent someone from using a different type). Or has newer Arrow has loosened this requirement? That's the only reason I can think we'd have made the assumption the keys would always be strings here.
I personally prefer supporting maps with keys of any Arrow dtype, but not sure how compatible with other implementations that will be. I do see the integration tests seem to only test maps with string keys.
Python (which uses C++ which could be considered the canonical implementation) only allows string keys.
Traceback (most recent call last):
File "/Users/dominik/Code/ramsch/map.py", line 8, in <module>
table = pa.table({'a': range(10), 'b': np.random.randn(10), 'c': [{x: x} for x in range(10)]})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/table.pxi", line 5204, in pyarrow.lib.table
File "pyarrow/table.pxi", line 1813, in pyarrow.lib._Tabular.from_pydict
File "pyarrow/table.pxi", line 5339, in pyarrow.lib._from_pydict
File "pyarrow/array.pxi", line 374, in pyarrow.lib.asarray
File "pyarrow/array.pxi", line 344, in pyarrow.lib.array
File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected dict key of type str or bytes, got 'int'
Rust doesn't seem to enforce string keys and we couldn't find anything in the spec so it seems to be a gray area right now.
Yeah, I'm inclined to tell DuckDB their Map implementation is non-conformant. Generally whatever libarrow
/pyarrow
does (and what's in the integration tests) is the source of truth.
Even if JS allows non-string keys, other implementations probably won't. This has been a significant source of confusion in the past, so we try to align with C++ as much as possible.
It is absolutely not the case that maps must have string keys. As just one example, part of the definition of ADBC metadata includes a map<int32, list<int32>>
.
You're right, I don't see C++ checking the keys type in MapArray::ValidateChildData()
. Maybe the string keys is just a thing pyarrow enforces. Either way, it'd be good to get some integration tests to settle any disagreements.
This is not how you would create a Map array with PyArrow. Example:
>>> ty = pa.map_(pa.int8(), pa.int32())
>>> a = pa.array([{1: 1000, 2: 10000}, {3: -1000}], type=ty)
>>> a
<pyarrow.lib.MapArray object at 0x7f7ab948dd80>
[
keys:
[
1,
2
]
values:
[
1000,
10000
],
keys:
[
3
]
values:
[
-1000
]
]
Oh, pa.table
doesn't recognize the dictionary. Thanks for the corrected code!
Anything else I can do to help this land? Seems like we still want this fix.
Apologies for bumping, but I’d love to help move this forward as this is currently preventing us from using DuckDB’s HISTOGRAM
function (e.g. SELECT HISTOGRAM(phot_g_mean_mag)
crashes per https://github.com/observablehq/framework/issues/1014). Please let me know if there is anything I can do.
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 117460b12f5a9f99f961f634e5a2ea85ad445992.
There were no benchmark performance regressions. 🎉
The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.