snowflake-connector-python icon indicating copy to clipboard operation
snowflake-connector-python copied to clipboard

Receive ARRAY, VARIANT, OBJECT datatypes in Arrow format

Open sfc-gh-heshah opened this issue 3 years ago • 8 comments

Please answer these questions before submitting your pull requests. Thanks!

  1. What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    SNOW-703039

  2. Fill out the following pre-review checklist:

    • [x] I am adding a new automated test(s) to verify correctness of my new code
    • [ ] I am adding new logging messages
    • [ ] I am adding a new telemetry message
    • [ ] I am modifying authorization mechanisms
    • [ ] I am adding new credentials
    • [ ] I am modifying OCSP code
    • [ ] I am adding a new dependency
  3. Please describe how your code solves the related issue.

    Extending the ArrowIterator cpp files with a new VariantConverter object that relies on the previously created SnowflakeJSONDecoder class to handle undefined values.

sfc-gh-heshah avatar Dec 16 '22 18:12 sfc-gh-heshah

@sfc-gh-heshah Any chance this PR is still viable? It would be really great to properly support arrays (including undefined values).

cpcloud avatar Jan 04 '23 17:01 cpcloud

@sfc-gh-heshah Any idea if support for nested data with arrow will ever be added?

cpcloud avatar Mar 14 '23 12:03 cpcloud

VARIANT data can also be for example Avro – not just JSON.

The code presented here seems to targetting VARIANT data which is JSON-encoded; or is Snowflake implicitly converting other semi-structured formats to JSON? I wouldn't think so.

malthe avatar Mar 14 '23 12:03 malthe

The code presented here seems to targetting VARIANT data which is JSON-encoded; or is Snowflake implicitly converting other semi-structured formats to JSON? I wouldn't think so.

I'm not sure I follow. Snowflake encodes things in its internal format and doesn't preserve the source format.

cpcloud avatar Mar 14 '23 12:03 cpcloud

@cpcloud in this changeset, I read code such as:

def _VARIANT_to_python(self, _) -> Callable[[str], Any]:
    """Use a custom json decoder (accepts undefined) to load row value"""
    return partial(json.loads, cls=SnowflakeJSONDecoder)

Like, we see a VARIANT value, let's decode it using JSON. But is that the only "wire format" we can get? I'm thinking both in terms of performance but also fidelity – JSON doesn't have the same data type richness as the other supported semi-structured formats.

malthe avatar Mar 14 '23 13:03 malthe

The fact that the arrow interop code here is pulling out a string and then immediately decoding it as JSON suggests that variants are always coming back as JSON.

I'd love to be wrong about this, other wire formats would be most welcome from the ibis perspective.

cpcloud avatar Mar 14 '23 13:03 cpcloud

I guess there's this option too https://docs.snowflake.com/en/user-guide/python-connector-example#label-python-connector-bypass-data-conversion which means we'd have to implement the entire conversion ourselves.

cpcloud avatar Mar 14 '23 13:03 cpcloud

Actually that method is broken and just gives back data in Python format.

cpcloud avatar Mar 14 '23 13:03 cpcloud