Jay Chia

Results 126 comments of Jay Chia

Closing this in favor of #2913

We can tackle append + overwrite first, and make a separate ticket for overwrite_partitions

> Idea from @Fokko - support day/month/year transforms first You can also try using the transforms that Daft has already implemented. Full list of transforms: * [Expression.partitioning.days]( https://www.getdaft.io/projects/docs/en/latest/api_docs/doc_gen/expression_methods/daft.Expression.partitioning.days.html) * [Expression.partitioning.hours](...

I also did some very rough benchmarks before/after making the rowgroups nicer: DuckDB: ``` Before Code block 'Run duckdb query 1' took: 5.84075 s Code block 'Run duckdb query 2'...

I tried doing `.collect().write_parquet()` on a `SCALE_FACTOR=0.2` dataset - it seems to be better but the rowgroups are still fairly fragmented (about 4MB compressed, 10MB uncompressed) but also noticing some...

> Though that recommendation doesn't mean better performance I think. 512MB is very large and we could a lot more in parallel if we shrink the sizes. > > Currently,...

> I think I will try to hit a row count rather than a row-group size (defaulting to 512^2). Currently there was an issue in Polars that allowed very small...

(1) sounds like the most compelling integration point! Happy to explore integrations there that might make sense.

I think struct and map types are fairly different ``` dict[str, int] # map type {"foo": int, "bar": str} # struct type ``` Maps can have any number of keys/values,...