aws-sdk-pandas Postgres JSONB field is (mostly) unsupported

Describe the bug

JSONB is an optimized version of JSON inside postgres. From the point of retrieval, it should behave the same way as normal JSON field.

While trivial fields are retrieved normally, fields with non-heterogeneous data (e.g. list with number and string, or mapping with heterogeneous types of values; see example below) can't be read using wrangler (while other tools handle it normally).

I did not check if wrangler handles JSON field better because I don't use it.

How to Reproduce

Create a

create table test_table
(
    test_table_id serial primary key,
    jsonb_column jsonb
);

insert into test_table (test_table_id, jsonb_column)
values  (1, '{"a": 1, "b": 2}'),
        (2, '{"a": 1, "b": "hey", "c": null, "d": []}');

Following attempts to retrieve this table using wr.postgresql.read_sql_query or read_sql_table result in an exception (or similar one reporting some wrong conversion):

ArrowInvalid: Could not convert 'hey' with type str: tried to convert to int64

Expected behavior

table data is retrieved without error.

Your project

No response

Screenshots

No response

OS

Mac 10.15

Python version

3.10

AWS SDK for pandas version

2.16.1

Additional context

No response

Sep 10 '22 03:09 arogozhnikov

AFAIK, this is not possible at the moment. In AWS SDK for pandas, data type conversion on read for postgres (and other databases) follows this approach: python type -> inferred arrow type -> pandas type. Here is where the first conversion from python type to arrow is done. At the moment, pyarrow does not handle mixed types and an exception is thrown instead of attempting a coercion (to string for example). Here is a way to reproduce the above:

import pandas as pd
import pyarrow as pa

df = pd.DataFrame({"a": [1, "foo"]})

table = pa.Table.from_pandas(df)

ArrowInvalid: ("Could not convert 'str' with type str: tried to convert to int64", 'Conversion failed for column a with type object')

Sep 16 '22 12:09 jaidisido

I see where problem comes from.

It should not be expected that pyarrow will ever change this behavior (they do not have tuple-like primitive and there is no reason to introduce one).

If you don't consider changing pyarrow (very unlikely), the right solution is adding a parameter that would enforce keeping some json fields as string/bytes, not parsing them to pythonic structures.

Sep 16 '22 22:09 arogozhnikov

@malachi-constant why is the issue closed?

Json is not a rare field type, and no solution/workaround was suggested.

Sep 19 '22 16:09 arogozhnikov

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.

Nov 18 '22 18:11 github-actions[bot]

This is still an issue that would be nice to fix!

In the meantime, one option to get around this issue is to annotate the type in your SQL so that your resultset uses strings instead of jsonb -> object decoding. For example, you can use SELECT jsonb_col::TEXT FROM tbl.

Jun 21 '23 22:06 davclark

aws-sdk-pandas aws-sdk-pandas copied to clipboard

Postgres JSONB field is (mostly) unsupported

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context

aws-sdk-pandas
aws-sdk-pandas copied to clipboard