Daft icon indicating copy to clipboard operation
Daft copied to clipboard

how to flatten/unnest a struct?

Open universalmind303 opened this issue 4 months ago • 1 comments

Is your feature request related to a problem? Please describe. I want to flatten all columns in a struct into the top level. But it seems like I need to manually select all keys to do that.

Describe the solution you'd like


urls = [
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ar/train-00004-of-00007.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ar/train-00005-of-00007.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ar/train-00006-of-00007.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.arc/train-00000-of-00001.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ary/train-00000-of-00001.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00013-of-00020.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00014-of-00020.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00015-of-00020.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00016-of-00020.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00017-of-00020.parquet",
    "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00018-of-00020.parquet"
]

parsed_urls = [{
    'scheme': urlparse(url).scheme,
    'host': urlparse(url).hostname,
    'path': urlparse(url).path,
    'query': urlparse(url).query,
    'fragment': urlparse(url).fragment,
    'username': urlparse(url).username,
    'password': urlparse(url).password,
    'port': urlparse(url).port
} for url in urls]


df = daft.from_pydict({ "parsed_urls": parsed_urls })

I first tried to do this

df.select(col('parsed_urls').struct.get("*"))

but wildcarding does not appear to be supported there.

I also tried .explode

df.explode(col('parsed_urls'))

but that seems to only work on list/fsl

universalmind303 avatar Sep 26 '24 20:09 universalmind303