Jay Chia
Jay Chia
Closing this in favor of #2913
cc @@skrawcz as well
We can tackle append + overwrite first, and make a separate ticket for overwrite_partitions
> Idea from @Fokko - support day/month/year transforms first You can also try using the transforms that Daft has already implemented. Full list of transforms: * [Expression.partitioning.days]( https://www.getdaft.io/projects/docs/en/latest/api_docs/doc_gen/expression_methods/daft.Expression.partitioning.days.html) * [Expression.partitioning.hours](...
I also did some very rough benchmarks before/after making the rowgroups nicer: DuckDB: ``` Before Code block 'Run duckdb query 1' took: 5.84075 s Code block 'Run duckdb query 2'...
I tried doing `.collect().write_parquet()` on a `SCALE_FACTOR=0.2` dataset - it seems to be better but the rowgroups are still fairly fragmented (about 4MB compressed, 10MB uncompressed) but also noticing some...
> Though that recommendation doesn't mean better performance I think. 512MB is very large and we could a lot more in parallel if we shrink the sizes. > > Currently,...
> I think I will try to hit a row count rather than a row-group size (defaulting to 512^2). Currently there was an issue in Polars that allowed very small...
(1) sounds like the most compelling integration point! Happy to explore integrations there that might make sense.
I think struct and map types are fairly different ``` dict[str, int] # map type {"foo": int, "bar": str} # struct type ``` Maps can have any number of keys/values,...