pandas icon indicating copy to clipboard operation
pandas copied to clipboard

ENH: `DataFrame.struct.explode(column, *, separator=".")` method to pull struct subfields into the parent DataFrame

Open tswast opened this issue 1 year ago • 2 comments

Feature Type

  • [X] Adding new functionality to pandas

  • [ ] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

Currently, I can use Series.struct.explode() to create a DataFrame out of the subfields of a ArrowDtype(pa.struct(...)) column. Joining these back to the original DataFrame is a little awkward. It'd be nice to have a top-level explode() similar to how DataFrame.explode() works on lists.

Feature Description

Add a new StructFrameAccessor to pandas/core/arrays/arrow/accessors.py. I think implementation could be almost identical to what I did here: https://github.com/googleapis/python-bigquery-dataframes/pull/916

class StructFrameAccessor:
    """
    Accessor object for structured data properties of the DataFrame values.
    """

    def __init__(self, data: DataFrame) -> None:
        self._parent = data


    def explode(self, column, *, separator: str = "."):
        """
        Extract all child fields of struct column(s) and add to the DataFrame.

        **Examples:**

            >>> countries = pd.Series(["cn", "es", "us"])
            >>> files = pd.Series(
            ...     [
            ...         {"version": 1, "project": "pandas"},
            ...         {"version": 2, "project": "pandas"},
            ...         {"version": 1, "project": "numpy"},
            ...     ],
            ...     dtype=pd.ArrowDtype(pa.struct(
            ...         [("version", pa.int64()), ("project", pa.string())]
            ...     ))
            ... )
            >>> downloads = pd.Series([100, 200, 300])
            >>> df = pd.DataFrame({"country": countries, "file": files, "download_count": downloads})
            >>> df.struct.explode("file")
              country  file.version file.project  download_count
            0      cn             1       pandas             100
            1      es             2       pandas             200
            2      us             1        numpy             300
            [3 rows x 4 columns]

        Args:
            column:
                Column(s) to explode. For multiple columns, specify a non-empty
                list with each element be str or tuple, and all specified
                columns their list-like data on same row of the frame must
                have matching length.
            separator:
                Separator/delimiter to use to separate the original column name
                from the sub-field column name.


        Returns:
            DataFrame:
                Original DataFrame with exploded struct column(s).
        """
        df = self._parent
        column_labels = check_column(column)

        for label in column_labels:
            position = df.columns.to_list().index(label)
            df = df.drop(columns=label)
            subfields = self._parent[label].struct.explode()
            for subfield in reversed(subfields.columns):
                df.insert(
                    position, f"{label}{separator}{subfield}", subfields[subfield]
                )

        return df


def check_column(
    column: Union[blocks.Label, Sequence[blocks.Label]],
) -> Sequence[blocks.Label]:
    if not is_list_like(column):
        column_labels = cast(Sequence[blocks.Label], (column,))
    else:
        column_labels = cast(Sequence[blocks.Label], tuple(column))

    if not column_labels:
        raise ValueError("column must be nonempty")
    if len(column_labels) > len(set(column_labels)):
        raise ValueError("column must be unique")

    return column_labels

Alternative Solutions

An alternative could be to modify DataFrame.explode to support exploding a struct into columns. Potentially with an axis parameter to explode into columns instead of rows.

Additional Context

See also, the Series.struct accessor added last year: https://github.com/pandas-dev/pandas/pull/54977

tswast avatar Aug 23 '24 16:08 tswast

Thanks for the request. From an API design perspective, having a dtype-specific accessor on DataFrame seems like a bad approach.

For the feature itself, it seems to me you can use Series.struct.explode along with pd.concat([df, ...,], axis=1). Does this not work in your use-case?

rhshadrach avatar Aug 25 '24 13:08 rhshadrach

From an API design perspective, having a dtype-specific accessor on DataFrame seems like a bad approach.

Gotcha. I did see we have some already. e.g. SparseFrameAccessor, which does seem SparseDtype specific.

For the feature itself, it seems to me you can use Series.struct.explode along with pd.concat([df, ...,], axis=1). Does this not work in your use-case?

Some nice things about this feature not covered by pd.concat([df, ...,], axis=1):

  1. The columns appear in the same order as originally, with the sub-fields replacing the original column.
  2. The column names include the original column name (separated by separator=".", whereas Series.struct.explode() only returns the subfield names as the column names.

tswast avatar Aug 26 '24 17:08 tswast