This library is amazing. Is there any way to use the library for PySpark and SQL instead of Pandas?
Hello Team,
Thanks for creating this amazing library. Is there any way to use the library for PySpark and SQL instead of Pandas?
Hey!
Right now this relies on being able to run aggregations quickly to summarize the data to add to the prompt, so it only really works if the data is in memory.
For things like dask, pyspark, modin, etc (remote data, likely 'big' data): this would require updating the aggregation code. This is theoretically possible (datasketch aggregations that this is intended to be working off of are O(N), parallelizable and mergable) That said, this doesn't support this right now.
For systems like "SQL" (eg. remote databases: snowflake, clickhouse, postgres, sqlite) this cannot be directly used right now without downloading the table first. (eg. can use pd.read_sql
are these sketches something we could add to a remote sql db as a udf and idex for faster usage?
thinking specifically of BigQuery https://cloud.google.com/bigquery/docs/reference/standard-sql/sketches