aws-sdk-pandas
aws-sdk-pandas copied to clipboard
Postgres JSONB field is (mostly) unsupported
Describe the bug
JSONB is an optimized version of JSON inside postgres. From the point of retrieval, it should behave the same way as normal JSON field.
While trivial fields are retrieved normally, fields with non-heterogeneous data (e.g. list with number and string, or mapping with heterogeneous types of values; see example below) can't be read using wrangler (while other tools handle it normally).
I did not check if wrangler handles JSON field better because I don't use it.
How to Reproduce
Create a
create table test_table
(
test_table_id serial primary key,
jsonb_column jsonb
);
insert into test_table (test_table_id, jsonb_column)
values (1, '{"a": 1, "b": 2}'),
(2, '{"a": 1, "b": "hey", "c": null, "d": []}');
Following attempts to retrieve this table using wr.postgresql.read_sql_query
or read_sql_table
result in an exception (or similar one reporting some wrong conversion):
ArrowInvalid: Could not convert 'hey' with type str: tried to convert to int64
Expected behavior
table data is retrieved without error.
Your project
No response
Screenshots
No response
OS
Mac 10.15
Python version
3.10
AWS SDK for pandas version
2.16.1
Additional context
No response
AFAIK, this is not possible at the moment. In AWS SDK for pandas, data type conversion on read for postgres (and other databases) follows this approach: python type -> inferred arrow type -> pandas type. Here is where the first conversion from python type to arrow is done. At the moment, pyarrow does not handle mixed types and an exception is thrown instead of attempting a coercion (to string for example). Here is a way to reproduce the above:
import pandas as pd
import pyarrow as pa
df = pd.DataFrame({"a": [1, "foo"]})
table = pa.Table.from_pandas(df)
ArrowInvalid: ("Could not convert 'str' with type str: tried to convert to int64", 'Conversion failed for column a with type object')
I see where problem comes from.
It should not be expected that pyarrow will ever change this behavior (they do not have tuple-like primitive and there is no reason to introduce one).
If you don't consider changing pyarrow (very unlikely), the right solution is adding a parameter that would enforce keeping some json fields as string/bytes, not parsing them to pythonic structures.
@malachi-constant why is the issue closed?
Json is not a rare field type, and no solution/workaround was suggested.
Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.
This is still an issue that would be nice to fix!
In the meantime, one option to get around this issue is to annotate the type in your SQL so that your resultset uses strings instead of jsonb -> object decoding. For example, you can use SELECT jsonb_col::TEXT FROM tbl
.