ibis
ibis copied to clipboard
bug: Move from `psycopg2` to `psycopg` makes `Redshift` unusable with `ibis-framework`
What happened?
Hi ibis support team and thank you for this wonderful package ❤
I was looking for an alternative to R's dplyr + dbplyr for python and the ibis-framework is the best one I could find.
Issue description
I've been experimenting with the new features of ibis-framework 10.0.0.dev490 (i.e. the addition of cases() and parameter distinct of the collect() method among others) and noticed that the move from psycopg2 to psycopg (3) introduces a regression which makes ibis-framework unusable with the Redshift backend. Please note that ibis-framework v9.5.0, which uses psycopg2, works well with Redshift.
Potential solutions
From https://github.com/psycopg/psycopg/issues/122 it isn't really clear when and if official Redshift support will be added to psycopg (3). Also, looking at https://github.com/ibis-project/ibis/pull/10659 it appears that the move away from psycopg2 was not motivated by breaking changes. I do understand that there are of course other valid reasons for moving to psycopg (3).
Would it be possible to go back to psycopg2?
Alternatively, is adding official Redshift support (via package redshift-connector or equivalent) on your project roadmap?
Steps to reproduce the issue
Opening a connection to Redshift:
rs_conn = ibis.postgres.connect(
user=<RS_UID>,
password=<RS_PWD>,
host=<RS_HOST>,
port=<RS_PORT>,
database=<RS_DATABASE>,
)
results in:
psycopg.NotSupportedError: codec not available in Python: 'UNICODE'
when execution reaches the Backend._post_connect() method.
I was able to get past the connection issue by using client_encoding=UTF8 as suggested in https://github.com/psycopg/psycopg/issues/122 like this:
os.environ["PGCLIENTENCODING"] = "utf-8"
but then creating an ibis table t = rs_conn.table(name="table", database="prod.schema") using method Backend.raw_sql() hits a type incompatibility:
psycopg.errors.UndefinedColumn: column "typarray" does not exist in t
What version of ibis are you using?
10.0.0.dev490
What backend(s) are you using, if any?
Redshift (through the ibis.postgres backend)
Relevant log output
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
Thanks for the issue!
We've heard reports from ... I can't remember exactly where that redshift works with Ibis, but glad to have a concrete statement about it somewhere.
psycopg.errors.UndefinedColumn: column "typarray" does not exist in t
For this error, can you provide a more detailed traceback? It might be possible to debug that without a redshift setup (probably wishful thinking!), but I would need the traceback.
Hi @cpcloud and thank you for the prompt reply.
t = rs_conn.table(name="table", database="prod.schema")
Here is a more complete traceback:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "[REDACTED]/.venv2/lib/python3.12/site-packages/ibis/backends/sql/__init__.py", line 94, in table
table_schema = self.get_schema(name, catalog=catalog, database=database)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "[REDACTED]/.venv2/lib/python3.12/site-packages/ibis/backends/postgres/__init__.py", line 514, in get_schema
with self._safe_raw_sql(type_info) as cur:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/python/3.12.6/lib/python3.12/contextlib.py", line 137, in __enter__
return next(self.gen)
^^^^^^^^^^^^^^
File "[REDACTED]/.venv2/lib/python3.12/site-packages/ibis/backends/postgres/__init__.py", line 700, in _safe_raw_sql
with contextlib.closing(self.raw_sql(*args, **kwargs)) as result:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "[REDACTED]/.venv2/lib/python3.12/site-packages/ibis/backends/postgres/__init__.py", line 717, in raw_sql
psycopg.types.TypeInfo.fetch(con, "hstore"),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "[REDACTED]/.venv2/lib/python3.12/site-packages/psycopg/_typeinfo.py", line 85, in fetch
return cls._fetch(conn, name)
^^^^^^^^^^^^^^^^^^^^^^
File "[REDACTED]/.venv2/lib/python3.12/site-packages/psycopg/_typeinfo.py", line 104, in _fetch
cur.execute(cls._get_info_query(conn), {"name": name})
File "[REDACTED]/.venv2/lib/python3.12/site-packages/psycopg/cursor.py", line 97, in execute
raise ex.with_traceback(None)
psycopg.errors.UndefinedColumn: column "typarray" does not exist in t
Let me know if any further info would be useful.
It looks like catching that exception is probably the way to go to unblock this very official redshift support.
@galen-ft I will put up a PR that you should be able test against.
Just tried PR#10805 and it does work around the immediate issue:
t = rs_conn.table("table", database="prod.schema")
# I get the warning you defined:
[REDACTED].venv2/lib/python3.12/site-packages/ibis/backends/postgres/__init__.py:730: UserWarning: Failed to load hstore extension: column "typarray" does not exist in t
warnings.warn(f"Failed to load hstore extension: {e}")
and then the following works as expected:
ibis.to_sql(t)
# SQLString('SELECT\n *\nFROM "prod"."schema"."table"')
Could you please keep this issue open a little longer so that we can do some more testing?
SQLBackend.table() is no longer backward compatible
By the way, the following worked on ibis-framework==10.0.0.dev490:
t = rs_conn.table(name="table", database="prod.schema")
but no longer does on your branch (respectively, branch main):
t = rs_conn.table(name="table", database="prod.schema")
Traceback (most recent call last):
File "<string>", line 1, in <module>
TypeError: SQLBackend.table() got some positional-only arguments passed as keyword arguments: 'name'
Seems like the first argument of method SQLBackend.table() is no longer a keyword arg (name), but is now a positional one (since commit https://github.com/ibis-project/ibis/commit/22de2b5f78b67f0f1b622387856e20e62587d6ac).
Was this change intentional? If so, it causes errors when we instantiate tables by passing the name kwarg explicitly.
Was this change intentional? If so, it causes errors when we instantiate tables by passing the name kwarg explicitly.
Yes, this was intentional as part of addressing #9125.
Could you please keep this issue open a little longer so that we can do some more testing?
Yep, no worries!
the issue happens here: https://github.com/ibis-project/ibis/blame/bbf98de4a71e4e9318919053cf2cdc8a0304a41e/ibis/backends/postgres/init.py#L725
hstore is not available on Redshift. to fix let UndefinedColumn to pass along with TypeError
I'm going to merge the PR associated with this issue, but I'll leave this issue open until you say so @galen-ft!
Thanks!
Hi @cpcloud , thank you for being patient w.r.t. this issue. We are still writing a few more tests which we'd like to run against the Redshift backend. I will let you know once we're done.
Also, happy to have the PR merged.
@galen-ft Any updates here? Would like to close this out if possible!
Hi @cpcloud , I do appreciate your patience. We're in a busy period, but in the meantime we've used ibis for several complex queries and we're at the end of running extensive testing. I do have a few things to report back and will do so at the first opportunity, a bit later this week. Thanks!
Hi @cpcloud and thank you again for the patience.
Below is what we found during testing.
ARRAY_AGG does not exist in Redshift
- method collect() transpiles to ARRAY_AGG for BigQuery and Postgres, however this function does not exist in Redshift. The Redshift equivalent is LISTAGG.
For example:
new_col =_.value.collect(distinct=True)
should become something along the lines of:
-- LISTAGG ignores NULL values and empty strings
LISTAGG(DISTINCT "t7"."value", ',') AS "new_col"
instead of:
ARRAY_AGG(DISTINCT "t7"."value") FILTER(WHERE "t7"."value" IS NOT NULL) AS "new_col"
Window function LAG uses the entire partition even though a frame is specified
For example, the following:
# ...
mutate(
new_col=_.some_col.lag().over(
ibis.window(
group_by=_.some_id,
order_by=[ibis.asc(_.time_stamp),ibis.asc(_.some_other_id)]
)
)
) \
# ...
is converted to
LAG("t9"."col1") OVER (PARTITION BY "t9"."col2" ORDER BY "t9"."col3" ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS new_col
in Redshift and Postgres.
This works correctly when the backend is BigQuery:
LAG(`t9`.`col1`) OVER (PARTITION BY `t9`.`col2` ORDER BY `t9`.`col3` ASC) AS `new_col`
Ibis does not close a Postgres/Redshift open connection
import ibis
rs_conn = ibis.postgres.connect(
user=user,
password=password,
host=host,
port=port,
database=database,
client_encoding="utf-8"
)
> ResourceWarning: connection <psycopg.Connection [IDLE] (host=<HOST> port=<PORT> user=<USER> database=<DB>) at 0x7f3e8c4c2060> was deleted while still open. Please use 'with' or '.close()' to close the connection
it looks like the backend doesn't provide a close() method.
rs_conn.close()
> AttributeError: 'Backend' object has no attribute 'close'
Thanks for the detailed reply. Responses inline.
ARRAY_AGGdoes not exist in Redshift
There's not much we can do here except possibly expose a dialect argument to the postgres backend. Realistically this is probably too much of a hack, since redshift has diverged significantly from postgres at this point, and we should just have another backend.
Is your company/organization willing to sponsor a redshift instance for running CI? That would be a strong motivator to add a new Redshift backend.
Window function LAG uses the entire partition even though a frame is specified
I'm not sure what you mean here. Did you mean this?
Window function LAG uses the entire partition even though a frame is not specified
(the emphasis is mine)
... in which case this might be a bug or an inconsistency across backends.
Ibis does not close a Postgres/Redshift open connection
The Ibis method is called disconnect, not `close.
Hi!
ARRAY_AGG does not exist in Redshift
There's not much we can do here except possibly expose a dialect argument to the postgres backend. Realistically this is probably too much of a hack, since redshift has diverged significantly from postgres at this point, and we should just have another backend.
Is your company/organization willing to sponsor a redshift instance for running CI? That would be a strong motivator to add a new Redshift backend.
I'll check to see if that can be done.
In the meantime, I think there may be an alternative. As I mentioned above, we are evaluating the ibis-framework as an alternative to R's dplyr + dbplyr.
That said, Hadley Wickham's dbplyr R package has defined a way to simulate a Redshift backend. It's built on top of Postgres and handles the difference in backend-redshift.R. Quoting their documentation, "there are generally few differences, apart from string manipulation". In particular, they use LISTAGG for flattening and provide overrides for LAG and LEAD among other functions.
Window function LAG uses the entire partition even though a frame is specified
I'm not sure what you mean here. Did you mean this?
Window function LAG uses the entire partition even though a frame is not specified
(the emphasis is mine)
... in which case this might be a bug or an inconsistency across backends.
Exactly, for the Postgres backend, ibis seems to add the frame clause even though it wasn't explicitly requested. This is not the case for BigQuery.
Here's the exact error:
psycopg.errors.SyntaxError: Frame clause should not be specified for window function lag
Hi @cpcloud , I've sent you a message on LinkedIn regarding your request so that we can discuss it further.
Is your company/organization willing to sponsor a redshift instance for running CI? That would be a strong motivator to add a new Redshift backend.