sketch icon indicating copy to clipboard operation
sketch copied to clipboard

This library is amazing. Is there any way to use the library for PySpark and SQL instead of Pandas?

Open nithinreddyy opened this issue 2 years ago • 3 comments

Hello Team,

Thanks for creating this amazing library. Is there any way to use the library for PySpark and SQL instead of Pandas?

nithinreddyy avatar Jan 17 '23 07:01 nithinreddyy

Hey!

Right now this relies on being able to run aggregations quickly to summarize the data to add to the prompt, so it only really works if the data is in memory.

For things like dask, pyspark, modin, etc (remote data, likely 'big' data): this would require updating the aggregation code. This is theoretically possible (datasketch aggregations that this is intended to be working off of are O(N), parallelizable and mergable) That said, this doesn't support this right now.

For systems like "SQL" (eg. remote databases: snowflake, clickhouse, postgres, sqlite) this cannot be directly used right now without downloading the table first. (eg. can use pd.read_sql

bluecoconut avatar Jan 18 '23 03:01 bluecoconut

are these sketches something we could add to a remote sql db as a udf and idex for faster usage?

andrewluetgers avatar Jan 23 '23 21:01 andrewluetgers

thinking specifically of BigQuery https://cloud.google.com/bigquery/docs/reference/standard-sql/sketches

andrewluetgers avatar Jan 23 '23 22:01 andrewluetgers