MisterKloudy
MisterKloudy
Would return only the unique elements of the lists like the python set() functionality. Example: ``` df = daft.from_pydict({"a": [[1, 2, 2, 3, 3, 3], [1, 3, 5, 5]]}) df.with_column("b",...
Currently agg_concat simply combines the strings without a delimiter so the alternative would be to first collect it as agg_list then do list.join with a delimiter but it would be...
I would like to get a minhash with alternative hash algorithms such as the first four bytes of SHA1 as implemented in https://github.com/bigcode-project/bigcode-dataset/blob/main/near_deduplication/minhash_deduplication_spark.py The deduplication rate is empirically much better...
When pyspark saves parquets to a folder on a partition, it creates folders of the partition=some_value. When I use daft to read_parquet the parent folder, I would like to get...
Would apply a function on each element in the list Example: ``` df = daft.from_pydict({"a": [["HeLLo WoRlD", "Hi", "WelCoMe"], ["tO", "a New WoRlD"]]}) df.with_column("b", col("a").list.apply(element().str.lower())).select("b").show() ``` Expected output: ``` ╭─────────────╮...
Would return the counts of each element in the lists like the pandas .value_counts() or numpy .unique(with_counts=True) functionality. **Example:** ``` df = daft.from_pydict({"a": [[1, 2, 2, 3, 3, 3], [1,...
### Search before asking - [X] I searched the [issues](https://github.com/IBM/data-prep-lab/issues) and found no similar issues. ### Component Transforms/Other, Other ### What happened + What you expected to happen Hi IBM...