etl icon indicating copy to clipboard operation
etl copied to clipboard

Update frame repacking to use new pyarrow data types

Open larsyencken opened this issue 10 months ago • 6 comments

Context

We do frame repacking to make our data frames smaller on disk and faster to work with. It has some slight annoyances, such as making more variables categorical, which complicates group-bys, for example.

What

With Pandas 2.2, we have the option to change our frame repacking to use new pyarrow data types, which are supposed to be much more efficient.

That would also bring our data catalog into compatibility with more of the data ecosystem (e.g. Polars, Nushell and friends).

larsyencken avatar Mar 28 '24 10:03 larsyencken

Blocked on:

  • #1094

larsyencken avatar Mar 28 '24 10:03 larsyencken

From discussion: we could also consider a gradual migration, e.g. enabling pyarrow types for new datasets rather than applying it over everything.

We could also bench the performance gains to see if they're worth it.

larsyencken avatar Apr 11 '24 09:04 larsyencken

Pablo noted that adding commitments to our future selves can mean that "small" data updates can expand. We should try to make more things as automatic as possible.

larsyencken avatar Apr 11 '24 09:04 larsyencken

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 10 '24 23:06 stale[bot]

Don't close it. We should at least compare the performance (CPU & mem) of current repacking vs new pyarrow dtypes.

Marigold avatar Jun 11 '24 05:06 Marigold

(While looking into this and trying to optimize reading the table, I discovered that loading is actually really fast (<1s), but setting primary key as an index is slow.)

Marigold avatar Jul 09 '24 09:07 Marigold