hamilton `check_output(schema=schema…)` validation can silently modify outputs from upstream nodes

It appears that applying the check_output decorator to a function applies inplace=true modifications to the result of that function, which, if the function merely passes through its inputs, can actually mean the inputs are modified in place instead.

If those inputs are the outputs of an upstream node then the check_output decorator on the downstream node is really modifying the outputs of the upstream node. That can lead to confusing results where, for example, the outputs of that upstream node are passed to two downstream functions, because even though the inputs to those functions appear to be the same in code and on a visualisation, they may be different at runtime.

This issue emerged for me where I attempted to use different pandera schemas with strict="filter" to split off, validate and process parts of the data separately before rejoining them. When those arms of the graph — effectively subgraphs — were run independently the results were as expected. But when the nodes were joined the outputs of one arm/subgraph were suddenly different because one validation step necessarily had to run before the other and in doing so modified the inputs to the other in a way that wasn't obvious. Thanks to @elijahbenizzy for helping to figure this out.

It appears from this line and those that follow (again thanks @elijahbenizzy for identifying) that the result of calling schema.validate is obtained but not returned — because the input is modified instead.

https://github.com/DAGWorks-Inc/hamilton/blob/e2f25b65079407da048c76989a08d5217ca16a1f/hamilton/data_quality/pandera_validators.py#L27

I found a temporary fix by changing occurrences of the following:

@check_output(schema=schema, …)
def my_func(my_df: pd.DataFrame): -> pd.DataFrame:
    return my_df

to instead read something like:

@check_output(schema=schema, …)
def my_func(my_df: pd.DataFrame): -> pd.DataFrame:
    return my_df.copy()

such that the in-place modified function result was not the same as the input my_df argument, which is the output of the preceding node.

Imho it would be better if the check_output decorator called schema.validate with inplace=False and modified the decorated function to return the result of that schema.validate call, rather than relying on the decorated function to explicitly ensure its inputs and outputs are not the same.

If however the current approach is preferred, I suggest some details be added to the documentation to make clearer that in-place modification will occur.

Thanks!

Aug 17 '23 01:08 amosaikman

🤔 I don't recall why we put inplace=True (sigh, commit message doesn't mention 🤦 ). Might be an easy fix as you suggest.

Aug 17 '23 04:08 skrawcz

🤔 I don't recall why we put inplace=True (sigh, commit message doesn't mention 🤦 ). Might be an easy fix as you suggest.

Yep, agreed, I think @amosaikman you have the right fix. Just want to think for a minute to make sure its backwards compatible, but I think it should be for the supported cases.

Aug 17 '23 04:08 elijahbenizzy

@Amos can you please share an example Pandera schema that has some coercion defined in it? That way we can make sure we're testing the right thing.

On Wed, Aug 16, 2023, 9:33 PM Elijah ben Izzy @.***> wrote:

🤔 I don't recall why we put inplace=True (sigh, commit message doesn't mention 🤦 ). Might be an easy fix as you suggest.

Yep, agreed, I think @amosaikman https://github.com/amosaikman you have the right fix. Just want to think for a minute to make sure its backwards compatible, but I think it should be.

— Reply to this email directly, view it on GitHub https://github.com/DAGWorks-Inc/hamilton/issues/278#issuecomment-1681597343, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARYMB7IWOSZX6GQD4OW65DXVWNI3ANCNFSM6AAAAAA3TKC5WI . You are receiving this because you commented.Message ID: @.***>

Aug 17 '23 06:08 skrawcz

Sure @skrawcz, let's take a simplified concrete example. We have some chemical data:

ProjectCode	LocationID	SampleID	From	To	P_pct	Li_ppm
Test	D1	SAMP001	0.0	1.0	0.001	0.74
Test	D1	SAMP002	1.0	2.0	0.001	0.74
Test	D1	SAMP003	2.0	3.0	0.001	0.92
Test	D1	SAMP004	3.0	4.0	0.001	1.09
Test	D1	SAMP005	4.0	5.0	0.001	1.18

First, let's ensure our index and column labels are ineligible:

pa.DataFrameSchema(
    index=pa.Index(
        str,
        nullable=False,
        unique=True,
        coerce=True,
    ),
    strict=False,
    unique_column_names=True,
)

then we'll split off the elemental concentrations

pa.DataFrameSchema(
    {
        "^[A-Z][a-z]*_ppm$": pa.Column(
            float,
            nullable=True,
            coerce=True,
            checks=pa.Check.in_range(0, 1e6),
            required=False,
            regex=True,
            title="Parts per million",
            description="Element concentration in PPM",
        ),
    ...
    },
    checks=[
        pa.Check(
            lambda df: df.columns.str.split("_").str[0].isin(VALID_SYMBOLS).all(),
            element_wise=False,
            error="Invalid element symbol in column name",
        ),
    ],
    strict="filter",
)

and take the other parts separately.

pa.DataFrameSchema(
    {
        "LocationID": pa.Column(
            str,
            coerce=True,
            checks=pa.Check.str_length(min_value=1),
            nullable=False,
            required=True,
        ),
        str(from_): pa.Column(
            float,
            coerce=True,
            checks=pa.Check.ge(0),
            nullable=False,
            required=True,
        ),
    ...
    },
    strict="filter",
    checks=pa.Check(
        lambda df: df["From"] < df["To"],
    ),
)

This approach is convenient because it lets us pick groups of chemical columns using Pandera's regex support, something I don't think Hamilton currently handles independently. The trouble is if we pass the output of the node checked with the first schema to nodes checked with the next two schema and they modify the upstream output in place, whichever one runs first will remove the part of the df the next one is meant to find so it will return an empty df. That is because these schema have strict="filter" (cf. strict=True/False) so are effectively doing extract and validate at once.

Aug 18 '23 01:08 amosaikman

Awesome thanks for the detailed response.

Just to ask the question, why not explicitly put pandera as part of the function, or a downstream one, rather than using it via @check_output if you want to modify the dataframe? My concern is that it wouldn't be obvious to me that @check_output would be modifying the dataframe.

Aug 18 '23 05:08 skrawcz

That's a fair argument. I guess it goes to whether or not you agree with Pandera's approach of validating and coercing or validating and filtering as a single step. I don't think it changes the argument that if Hamilton permits Pandera coercion and filtering via check_output, it should either not mutate upstream outputs or make clearer that's a risk — imho.

Aug 18 '23 05:08 amosaikman

That's a fair argument. I guess it goes to whether or not you agree with Pandera's approach of validating and coercing or validating and filtering as a single step. I don't think it changes the argument that if Hamilton permits Pandera coercion and filtering via check_output, it should either not mutate upstream outputs or make clearer that's a risk — imho.

Yup. My inclination is that @check_output shouldn't mutate output (it's also simpler to implement). I guess we could have a flag to enable it (so it would be clear) if people wanted that behavior.

Otherwise looking at the code to enable your preference (of check_output returning a new dataframe), is a little bit of work it looks like :/ hmm @elijahbenizzy thoughts?

Aug 18 '23 05:08 skrawcz

That's a fair argument. I guess it goes to whether or not you agree with Pandera's approach of validating and coercing or validating and filtering as a single step. I don't think it changes the argument that if Hamilton permits Pandera coercion and filtering via check_output, it should either not mutate upstream outputs or make clearer that's a risk — imho.

Yup. My inclination is that @check_output shouldn't mutate output (it's also simpler to implement). I guess we could have a flag to enable it (so it would be clear) if people wanted that behavior.

Otherwise looking at the code to enable your preference (of check_output returning a new dataframe), is a little bit of work it looks like :/ hmm @elijahbenizzy thoughts?

Yeah, so I designed it to not allow modification (and I personally don't like the data validator also coerces strategy, but I get it, and definitely see the value.

I think rewiring wouldn't be too hard -- its just a matter of changing how it works. I do think we should bury it behind a parameter (modifies_original or something like that) that sets it up, and we can also have the data quality checker tell us if it has the capability to modify the data (so we can error out if you ask it to modify when its just doing validation).

Aug 18 '23 17:08 elijahbenizzy

Obviously it's entirely up to you guys what you do, but from a user perspective the following makes sense to me:

a change that precludes silent upstream mutation by default
an option to pass through parameters like inplace= and lazy= from the decorator to the validator
some additions to the documentation about using pandera with hamilton — especially around best practices, trade-offs and different approaches.

I generally agree that where strict="filter" in the schema, the operation is really more extraction rather than validation, and hence it is probably better run in a transform rather than on a decorator. That facilitates passing config params to the schema without @resolve. But I'm sure there's an edge case I haven't thought of where a filtering schema applied in a decorator is still the best approach.

Aug 22 '23 00:08 amosaikman

Obviously it's entirely up to you guys what you do, but from a user perspective the following makes sense to me:

a change that precludes silent upstream mutation by default

an option to pass through parameters like inplace= and lazy= from the decorator to the validator

some additions to the documentation about using pandera with hamilton — especially around best practices, trade-offs and different approaches.

I generally agree that where strict="filter" in the schema, the operation is really more extraction rather than validation, and hence it is probably better run in a transform rather than on a decorator. That facilitates passing config params to the schema without @resolve. But I'm sure there's an edge case I haven't thought of where a filtering schema applied in a decorator is still the best approach.

Yes, I think that's a pretty good approach. We should definitely change the default, but that'll be tied with a mode that allows mutation/not. It might even be something like check_output.mutating_schema that knows its allowed to mutate, and we can rewire the DAG accordingly. I highly doubt in_place ever makes sense, but it might be nice to expose for the sake of memory optimization. TBD on how exactly we expose that, but I think we can pass a parameter in to make that happen, or arbitrary params to the validator.

Just double-checking -- you're not blocked on this now, right?

Aug 24 '23 04:08 elijahbenizzy

No, not blocked. It was briefly infuriating when I couldn't figure out what was happening but actually turned out good coz it prompted me to rewrite some stuff that needed improving anyway. Tks for checking.

Aug 24 '23 04:08 amosaikman

Closing for now.

Jul 18 '24 18:07 skrawcz