Daft
Daft copied to clipboard
how to flatten/unnest a struct?
Is your feature request related to a problem? Please describe. I want to flatten all columns in a struct into the top level. But it seems like I need to manually select all keys to do that.
Describe the solution you'd like
urls = [
"https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ar/train-00004-of-00007.parquet",
"https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ar/train-00005-of-00007.parquet",
"https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ar/train-00006-of-00007.parquet",
"https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.arc/train-00000-of-00001.parquet",
"https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ary/train-00000-of-00001.parquet",
"https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00013-of-00020.parquet",
"https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00014-of-00020.parquet",
"https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00015-of-00020.parquet",
"https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00016-of-00020.parquet",
"https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00017-of-00020.parquet",
"https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.de/train-00018-of-00020.parquet"
]
parsed_urls = [{
'scheme': urlparse(url).scheme,
'host': urlparse(url).hostname,
'path': urlparse(url).path,
'query': urlparse(url).query,
'fragment': urlparse(url).fragment,
'username': urlparse(url).username,
'password': urlparse(url).password,
'port': urlparse(url).port
} for url in urls]
df = daft.from_pydict({ "parsed_urls": parsed_urls })
I first tried to do this
df.select(col('parsed_urls').struct.get("*"))
but wildcarding does not appear to be supported there.
I also tried .explode
df.explode(col('parsed_urls'))
but that seems to only work on list/fsl