connector-x
connector-x copied to clipboard
Implement partition_on for non numeric columns
Describe your feature request
I'm not even sure if this is something that can be implemented, but it would be amazing if partition_on could be used for non-numeric columns!
Thanks for the suggestion @troyyyang ! Currently, we don't have an internal default way to partition on non-numerical columns for now. Please feel free to share how you think it should be implemented.
In the meanwhile, if you know how to partition on your query, you can do the partition outside of connectorx, and pass in a list of partitioned queries like this example so you won't be restricted to the type of the columns right now:
import connectorx as cx
postgres_url = "postgresql://username:password@server:port/database"
queries = ["SELECT * FROM lineitem WHERE l_orderkey <= 30000000", "SELECT * FROM lineitem WHERE l_orderkey > 30000000"]
cx.read_sql(postgres_url, queries)
@wangxiaoying theoretically, partitioning by timestamps could also be added (they support min/max). This will be useful for time series data.
Hi @valxv , thank you for the great suggestion. We will add this feature to our future plan : ) https://github.com/sfu-db/connector-x/issues/313
In spark I have gotten around partitioned reads of non-numeric columns by doing something like the following where I hash the non numeric column and use the modulus as the partition number.
SELECT
ABS(hashtext(non_numeric_column0)) % 10 as partition,
non_numeric_column,
column1,
column2,
FROM
table
Would doing something similar work for connectorx as well?
Hi @theelderbeever , I think it should work. You can set partition as the partition column and partition number to 10 in this example.
When will this feature be launched?