vaex
vaex copied to clipboard
[FEATURE-REQUEST] Create pyarrow structs via vaex
Description Since vaex provides all these great struct operations, it would be great if we could create structs in vaex directly via massive dataframes
Additional context
import pyarrow as pa
import vaex
df = vaex.example()
df["xyz"] = pa.StructArray.from_arrays(
arrays=[df.x.values, df.y.values, df.z.values], names=["x", "y", "z"]
)
df
Now we can use structs, but we brought everything into memory
import pyarrow as pa
import vaex
df = vaex.example()
df["xyz"] = pa.StructArray.from_arrays(
arrays=[df.x, df.y, df.z], names=["x", "y", "z"]
)
df
that would be great, but it fails.
Even better would be a helper function, something like
import pyarrow as pa
import vaex
df = vaex.example()
df["xyz"] = df.func.create_arrow_struct(df.x, df.y, df.z)
df
or something similar
What do you think of this:
@vaex.register_function()
def create_arrow_struct(**kwargs):
return pa.StructArray.from_arrays(kwargs.values(), kwargs.keys())
df = vaex.datasets.titanic()
df.func. create_arrow_struct(name=df['name'], age=df['age'])
That's great!
But @maartenbreddels it doesn't work if you try to listAgg that struct column. Maybe that's a new issue, not sure.
Yeah, we can only do that on primitives and strings. Maybe we can split the struct, and merge it back again automatically.
@JovanVeljanoski any opinions on this? How should we attach this, or do you like my code proposal?
This is the opposite of https://github.com/vaexio/vaex/pull/2072 so once we merge that we should take another look at this.
@JovanVeljanoski any opinions on this? How should we attach this, or do you like my code proposal?
Still thinking about it.. i want to do some tests but busy... :S
I think this would be nice
df = vaex.from_scalars(user_name="Maarten", user_surname="Breddels")
df = df.struct.merge(join_char="_") # this will automatically collect all user_* into a column name user
and
df = vaex.datasets.titanic()
df = df.struct.merge({'person': ['name', 'age']} # will create a person struct column based on name and age
or..
df = df.struct.merge({'Person': {'name':'Name', 'age':'Age']} # use a dict to rename?
I like the proposal of @maartenbreddels above. The one correction/suggestion I would make is this
df['person'] = df.struct.merge(['name', 'age'])
df['person'] = df.struct.merge({'name':'Name', 'age':'Age'})
Although I have to say i do not know if merge
is the right method name here.. i would naively that most methods in the struct
namespace operate on structs rather than create them.. so .. something like struct.create
or struct.from_expressions
might be more explicit?
Yes, since you can image 'df.struct` doing a type check, it also feels odd to me. But, this does organize all methods.
Can you start by writing a test, we can do a last-minute name change anyway.