vaex icon indicating copy to clipboard operation
vaex copied to clipboard

[FEATURE-REQUEST] Create pyarrow structs via vaex

Open Ben-Epstein opened this issue 2 years ago • 9 comments

Description Since vaex provides all these great struct operations, it would be great if we could create structs in vaex directly via massive dataframes

Additional context

import pyarrow as pa
import vaex


df = vaex.example()

df["xyz"] = pa.StructArray.from_arrays(
    arrays=[df.x.values, df.y.values, df.z.values], names=["x", "y", "z"]
)
df

Now we can use structs, but we brought everything into memory

import pyarrow as pa
import vaex


df = vaex.example()

df["xyz"] = pa.StructArray.from_arrays(
    arrays=[df.x, df.y, df.z], names=["x", "y", "z"]
)
df

that would be great, but it fails.

Even better would be a helper function, something like

import pyarrow as pa
import vaex


df = vaex.example()

df["xyz"] = df.func.create_arrow_struct(df.x, df.y, df.z)
df

or something similar

Ben-Epstein avatar Apr 26 '22 17:04 Ben-Epstein

What do you think of this:

@vaex.register_function()
def create_arrow_struct(**kwargs):
    return pa.StructArray.from_arrays(kwargs.values(), kwargs.keys())

df = vaex.datasets.titanic()
df.func. create_arrow_struct(name=df['name'], age=df['age'])

maartenbreddels avatar Apr 26 '22 17:04 maartenbreddels

That's great!

But @maartenbreddels it doesn't work if you try to listAgg that struct column. Maybe that's a new issue, not sure.

Ben-Epstein avatar Apr 29 '22 13:04 Ben-Epstein

Yeah, we can only do that on primitives and strings. Maybe we can split the struct, and merge it back again automatically.

maartenbreddels avatar Apr 29 '22 17:04 maartenbreddels

@JovanVeljanoski any opinions on this? How should we attach this, or do you like my code proposal?

maartenbreddels avatar May 13 '22 11:05 maartenbreddels

This is the opposite of https://github.com/vaexio/vaex/pull/2072 so once we merge that we should take another look at this.

maartenbreddels avatar Jun 08 '22 09:06 maartenbreddels

@JovanVeljanoski any opinions on this? How should we attach this, or do you like my code proposal?

Still thinking about it.. i want to do some tests but busy... :S

JovanVeljanoski avatar Jun 08 '22 09:06 JovanVeljanoski

I think this would be nice

df = vaex.from_scalars(user_name="Maarten", user_surname="Breddels")
df = df.struct.merge(join_char="_") # this will automatically collect all user_* into a column name user

and

df = vaex.datasets.titanic()
df = df.struct.merge({'person': ['name', 'age']} # will create a person struct column based on name and age 
or..
df = df.struct.merge({'Person': {'name':'Name', 'age':'Age']} # use a dict to rename?

maartenbreddels avatar Jun 08 '22 10:06 maartenbreddels

I like the proposal of @maartenbreddels above. The one correction/suggestion I would make is this

df['person'] = df.struct.merge(['name', 'age'])

df['person'] = df.struct.merge({'name':'Name', 'age':'Age'})

Although I have to say i do not know if merge is the right method name here.. i would naively that most methods in the struct namespace operate on structs rather than create them.. so .. something like struct.create or struct.from_expressions might be more explicit?

JovanVeljanoski avatar Aug 08 '22 20:08 JovanVeljanoski

Yes, since you can image 'df.struct` doing a type check, it also feels odd to me. But, this does organize all methods.

Can you start by writing a test, we can do a last-minute name change anyway.

maartenbreddels avatar Aug 30 '22 07:08 maartenbreddels