connector-x
connector-x copied to clipboard
Specifying both `LIMIT` and `ORDER BY` leads to incorrect/no results when using `partition_on` due to incorrect partition range query
What language are you using?
Python
What version are you using?
0.3.1
What database are you using?
PostgreSQL
What dataframe are you using?
Pandas
Can you describe your bug?
When using a query with a LIMIT
and ORDER BY
in a query along with the connector-x partition_on
parameter, the ORDER BY
clause is not included in the CXTMPTAB_RANGE
query that determines the bounds for the partitioning, but the LIMIT
is included. This means that the bounds it returns are not likely to match the subsequent queries that apply both the LIMIT
and ORDER BY
. As such, running the query via connector-x can unexpectedly result in no rows being returned and empty data in the resultant pandas dataframe.
Instead, the CXTMPTAB_RANGE
query for partition_on
should include the LIMIT
, OFFSET
, and ORDER BY
in order to ensure the bounds are consistent. Otherwise the bounds are "arbitrary" based on whatever order the database happened to have used during the range query.
What are the steps to reproduce the behavior?
Example query / code
cx.read_sql(
conn=postgres_url,
query='SELECT * FROM lineitem ORDER BY l_orderkey LIMIT 100',
return_type='pandas',
partition_on='l_orderkey',
partition_num=1,
)
Queries that will be issued by connector-x:
🐛 Query for partition bounds—this is the problematic query missing ORDER BY
:
SELECT min(CXTMPTAB_RANGE.l_orderkey), max(CXTMPTAB_RANGE.l_orderkey) FROM (SELECT * FROM lineitem LIMIT 100) AS CXTMPTAB_RANGE
Count query (also missing ORDER BY
but seemingly not consequential):
SELECT count(*) FROM (SELECT * FROM lineitem LIMIT 100) AS CXTMPTAB_COUNT
Query for data (includes limit and order-by, but wrong bounds from range query above):
COPY (SELECT * FROM (SELECT * FROM lineitem ORDER BY lineitem.l_orderkey LIMIT 100) AS CXTMPTAB_PART WHERE xxxxx <= CXTMPTAB_PART.l_orderkey AND CXTMPTAB_PART.l_orderkey < yyyyy) TO STDOUT WITH BINARY
What is the error?
There is no error, however, a dataframe with blank values is returned since the connector-x query that fetches data ends up using the wrong bounds and so can get no results like this:
It could also be that connector-x will return the wrong results (rather than no/blank results), depending on whatever incorrect bounds end up getting used.