Daft
Daft copied to clipboard
Distributed DataFrame for Python designed for the cloud, powered by Rust
Closes #1768 This is a POC for adding overwrite / overwrite partitions mode for our write methods. The idea is to collect all the file paths that were written across...
Write a guide to enumerate key concepts around partitioning: ``` Increasing the number of partitions in your DataFrame has the following effects: 1. Increase the amount of parallelism available to...
**Describe the bug** If a task crashes during a write on append mode, it will restart and write all the files again, leaving behind dirty files. **To Reproduce** Steps to...
**Is your feature request related to a problem? Please describe.** When users run `df.count()`, they often expect `df.count_rows()` behavior. Instead, `df.count()` will perform a count aggregation on every column, which...
User-defined global expressions, similar to typical UDFs, are Python functions that users can use as expressions. However, what is different about global expressions is that they produce a value with...
Additional expressions: - [ ] concat - [ ] collect_list - [ ] collect_set - same as collect_list but no duplicates - [ ] distinct - Special in that it’s...
`DataFrame.groupby` should correctly accept list expressions. Expected behavior: ```python >>> df = daft.from_pydict({ ... "strings": ["a", "b", "c", "d"], ... "lists": [[1, 1, 1, 1], [1, 1, 1, 1], [2,...
Hey - so this might not be on the roadmap for Daft at all, but I thought it was worth asking about! Also, just to say, thanks for building this...
allow for regex in expressions. For example to select all expressions that start with `c` ``` df.select(col("c*")) ``` flatten a struct `c` ``` df.select(col("c.*")) ``` See: https://github.com/Eventual-Inc/Daft/discussions/1964
In an effort to mitigate a max protobuf size (> 2 GB) error in Ray, we currently pass reduce task inputs as a list of object refs and `ray.get()` them...