arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-40407: [JS] Fix string coercion in MapRowProxyHandler.ownKeys

Open mbostock opened this issue 11 months ago • 11 comments

Rationale for this change

The ProxyHandler.ownKeys implementation must return strings or symbols. Because of this bug, it was returning numbers, causing the in operator to crash when trying to iterate over the keys of a MapRow object.

An example of this is a DuckDB SQL query using the HISTOGRAM operator:

SELECT HISTOGRAM(phot_g_mean_mag) FROM gaia

What changes are included in this PR?

Instead of calling array.map(String), which returns a typed array of non-strings when array is a typed array, call Array.from which is guaranteed to return strings.

Are these changes tested?

Apologies, but I don’t know how to test this.

Are there any user-facing changes?

This fixes a crash when using the in operator on a MapRow object.

  • GitHub Issue: #40407

mbostock avatar Mar 07 '24 19:03 mbostock

:warning: GitHub issue #40407 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Mar 07 '24 19:03 github-actions[bot]

LGTM, thanks @mbostock!

trxcllnt avatar Mar 07 '24 19:03 trxcllnt

Woot, thanks for the approval @trxcllnt. I’m excited to contribute. 😁

mbostock avatar Mar 07 '24 19:03 mbostock

IIRC the Map keys are required to be strings (tho I'm not sure we prevent someone from using a different type). Or has newer Arrow has loosened this requirement? That's the only reason I can think we'd have made the assumption the keys would always be strings here.

trxcllnt avatar Mar 07 '24 19:03 trxcllnt

I personally prefer supporting maps with keys of any Arrow dtype, but not sure how compatible with other implementations that will be. I do see the integration tests seem to only test maps with string keys.

trxcllnt avatar Mar 07 '24 19:03 trxcllnt

Python (which uses C++ which could be considered the canonical implementation) only allows string keys.

Traceback (most recent call last):
  File "/Users/dominik/Code/ramsch/map.py", line 8, in <module>
    table = pa.table({'a': range(10), 'b': np.random.randn(10), 'c': [{x: x} for x in range(10)]})
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 5204, in pyarrow.lib.table
  File "pyarrow/table.pxi", line 1813, in pyarrow.lib._Tabular.from_pydict
  File "pyarrow/table.pxi", line 5339, in pyarrow.lib._from_pydict
  File "pyarrow/array.pxi", line 374, in pyarrow.lib.asarray
  File "pyarrow/array.pxi", line 344, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected dict key of type str or bytes, got 'int'

Rust doesn't seem to enforce string keys and we couldn't find anything in the spec so it seems to be a gray area right now.

domoritz avatar Mar 07 '24 20:03 domoritz

Yeah, I'm inclined to tell DuckDB their Map implementation is non-conformant. Generally whatever libarrow/pyarrow does (and what's in the integration tests) is the source of truth.

Even if JS allows non-string keys, other implementations probably won't. This has been a significant source of confusion in the past, so we try to align with C++ as much as possible.

trxcllnt avatar Mar 07 '24 21:03 trxcllnt

It is absolutely not the case that maps must have string keys. As just one example, part of the definition of ADBC metadata includes a map<int32, list<int32>>.

CurtHagenlocher avatar Mar 07 '24 21:03 CurtHagenlocher

You're right, I don't see C++ checking the keys type in MapArray::ValidateChildData(). Maybe the string keys is just a thing pyarrow enforces. Either way, it'd be good to get some integration tests to settle any disagreements.

trxcllnt avatar Mar 07 '24 21:03 trxcllnt

This is not how you would create a Map array with PyArrow. Example:

>>> ty = pa.map_(pa.int8(), pa.int32())
>>> a = pa.array([{1: 1000, 2: 10000}, {3: -1000}], type=ty)
>>> a
<pyarrow.lib.MapArray object at 0x7f7ab948dd80>
[
  keys:
  [
    1,
    2
  ]
  values:
  [
    1000,
    10000
  ],
  keys:
  [
    3
  ]
  values:
  [
    -1000
  ]
]

pitrou avatar Mar 07 '24 22:03 pitrou

Oh, pa.table doesn't recognize the dictionary. Thanks for the corrected code!

domoritz avatar Mar 08 '24 00:03 domoritz

Anything else I can do to help this land? Seems like we still want this fix.

mbostock avatar Mar 21 '24 20:03 mbostock

Apologies for bumping, but I’d love to help move this forward as this is currently preventing us from using DuckDB’s HISTOGRAM function (e.g. SELECT HISTOGRAM(phot_g_mean_mag) crashes per https://github.com/observablehq/framework/issues/1014). Please let me know if there is anything I can do.

mbostock avatar Apr 02 '24 16:04 mbostock

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 117460b12f5a9f99f961f634e5a2ea85ad445992.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.