connector-x icon indicating copy to clipboard operation
connector-x copied to clipboard

Specifying both `LIMIT` and `ORDER BY` leads to incorrect/no results when using `partition_on` due to incorrect partition range query

Open sjdemartini opened this issue 1 year ago • 1 comments

What language are you using?

Python

What version are you using?

0.3.1

What database are you using?

PostgreSQL

What dataframe are you using?

Pandas

Can you describe your bug?

When using a query with a LIMIT and ORDER BY in a query along with the connector-x partition_on parameter, the ORDER BY clause is not included in the CXTMPTAB_RANGE query that determines the bounds for the partitioning, but the LIMIT is included. This means that the bounds it returns are not likely to match the subsequent queries that apply both the LIMIT and ORDER BY. As such, running the query via connector-x can unexpectedly result in no rows being returned and empty data in the resultant pandas dataframe.

Instead, the CXTMPTAB_RANGE query for partition_on should include the LIMIT, OFFSET, and ORDER BY in order to ensure the bounds are consistent. Otherwise the bounds are "arbitrary" based on whatever order the database happened to have used during the range query.

What are the steps to reproduce the behavior?

Example query / code
cx.read_sql(
    conn=postgres_url,
    query='SELECT * FROM lineitem ORDER BY l_orderkey LIMIT 100',
    return_type='pandas',
    partition_on='l_orderkey',
    partition_num=1,
)
Queries that will be issued by connector-x:

🐛 Query for partition bounds—this is the problematic query missing ORDER BY :

SELECT min(CXTMPTAB_RANGE.l_orderkey), max(CXTMPTAB_RANGE.l_orderkey) FROM (SELECT * FROM lineitem LIMIT 100) AS CXTMPTAB_RANGE

Count query (also missing ORDER BY but seemingly not consequential):

SELECT count(*) FROM (SELECT * FROM lineitem LIMIT 100) AS CXTMPTAB_COUNT

Query for data (includes limit and order-by, but wrong bounds from range query above):

COPY (SELECT * FROM (SELECT * FROM lineitem ORDER BY lineitem.l_orderkey LIMIT 100) AS CXTMPTAB_PART WHERE xxxxx <= CXTMPTAB_PART.l_orderkey AND CXTMPTAB_PART.l_orderkey < yyyyy) TO STDOUT WITH BINARY

What is the error?

There is no error, however, a dataframe with blank values is returned since the connector-x query that fetches data ends up using the wrong bounds and so can get no results like this:

image

It could also be that connector-x will return the wrong results (rather than no/blank results), depending on whatever incorrect bounds end up getting used.

sjdemartini avatar Apr 11 '23 18:04 sjdemartini