Daft
Daft copied to clipboard
Distributed DataFrame for Python designed for the cloud, powered by Rust
Currently agg_concat simply combines the strings without a delimiter so the alternative would be to first collect it as agg_list then do list.join with a delimiter but it would be...
I would like to get a minhash with alternative hash algorithms such as the first four bytes of SHA1 as implemented in https://github.com/bigcode-project/bigcode-dataset/blob/main/near_deduplication/minhash_deduplication_spark.py The deduplication rate is empirically much better...
When pyspark saves parquets to a folder on a partition, it creates folders of the partition=some_value. When I use daft to read_parquet the parent folder, I would like to get...
**Is your feature request related to a problem? Please describe.** - [ ] On the Docs homepage, add example tutorials for each of Data Engineering, Analytics and ML/AI training and...
https://github.com/Eventual-Inc/Daft/blob/main/src/daft-core/src/array/ops/groups.rs#L43
**Is your feature request related to a problem? Please describe.** I don't think we should use archaic python naming conventions to drive our DSL. Nearly all of our other functions...
**Is your feature request related to a problem? Please describe.** I want to flatten all columns in a struct into the top level. But it seems like I need to...
**Is your feature request related to a problem? Please describe.** for a column containing URLs, I'd like to parse them and extract relevant components **Describe the solution you'd like** ```py...
### Describe the bug I am trying to do this: ``` import daft df1 = daft.from_pydict({"a": [1, 2, 3], "b": ["foo", "bar", "baz"]}) df2 = daft.from_pydict({"a": [1, 2, 3], "c":...
Implement outer joins for Swordfish. (Yes, this PR is a little big. But: 1. at least tests run in CI now, so you don't need to just take my word...