hamilton icon indicating copy to clipboard operation
hamilton copied to clipboard

Add pandas result builder that converts to long format

Open skrawcz opened this issue 3 years ago • 1 comments

Is your feature request related to a problem? Please describe. Hamilton works on "wide" columns -- not "long ones". However the "tidy" data ethos thinks data should be in a long format -- it does make some things easier to do.

Describe the solution you'd like Add a ResultBuilder variant that takes in how you'd want to collapse the resulting pandas dataframe.

Describe alternatives you've considered People do this manually -- but perhaps in the result builder makes more sense.

Additional context Prerequisites for someone picking this up:

  • know Pandas.
  • know python.
  • can write the pandas code to go from wide to long.
  • can read the Hamilton code base to figure out where to add it.

skrawcz avatar Apr 25 '22 21:04 skrawcz

So this doesn't appear to be as simple as I thought it would be.

The issue going wide to long, is that you need some context to know how to collapse things. To pass that context in, you cannot have a static method, since it can't reference self, which is what build_result() in the ResultMixin is.

Here's some possible code -- however it's limited in use to non - distributed/cluster computation settings.

class SimplePythonLongFormatDataFrameGraphAdapter(SimplePythonDataFrameGraphAdapter):
    """Adapter for building a long format pandas dataframe from the result.

    There are two pandas methods that could be used:
     - melt() - https://pandas.pydata.org/docs/reference/api/pandas.melt.html#pandas.melt
    or
     - wide_to_long() - https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html

    The user must tell this object which one to use, and provide the correct arguments.
    """
    def __init__(self, method_name: str, **method_kwargs: Any):
        """

        :param method_name:  the name of the pandas function to use for going from wide to long format.
            Currently "melt" and "wide_to_long".
        :param method_kwargs: the arguments, other than the dataframe, to provide for that specific method.
            See:
             - melt() - https://pandas.pydata.org/docs/reference/api/pandas.melt.html#pandas.melt
             - wide_to_long() - https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html
            For information on what arguments to pass in .
        """
        if method_name not in ['melt', 'wide_to_long']:
            raise ValueError(f"Error, unknown {method_name} provided. It should be one of ['melt', 'wide_to_long']")
        self.method_name = method_name
        self.method_kwargs = method_kwargs

    def build_result(self, **outputs: typing.Dict[str, typing.Any]) -> typing.Any:
        """Delegates to the result builder function supplied."""
        wide_df = super(SimplePythonDataFrameGraphAdapter, self).build_result(**outputs)
        pandas_method = getattr(pd, self.method_name)
        long_df = pandas_method(wide_df, **self.method_kwargs)
        del wide_df  # clean this representation up.
        return long_df

skrawcz avatar Apr 30 '22 05:04 skrawcz

@skrawcz I'm not sure I like the abstraction above. Way too coupled to pandas specifics/APIs. Rather, we should come up with a pretty simple API (or multiple) that express what, exactly, we want. melt has a massive amount of complex code, pretty sure wide_to_long just calls it and is more user-friendly. And we should be able to use similar parameters...

elijahbenizzy avatar Oct 29 '22 17:10 elijahbenizzy

We are moving repositories! Please see the new version of this issue at https://github.com/DAGWorks-Inc/hamilton/issues/26. Also, please give us a star/update any of your internal links.

elijahbenizzy avatar Feb 26 '23 17:02 elijahbenizzy