`check_output(schema=schema…)` validation can silently modify outputs from upstream nodes
It appears that applying the check_output decorator to a function applies inplace=true modifications to the result of that function, which, if the function merely passes through its inputs, can actually mean the inputs are modified in place instead.
If those inputs are the outputs of an upstream node then the check_output decorator on the downstream node is really modifying the outputs of the upstream node. That can lead to confusing results where, for example, the outputs of that upstream node are passed to two downstream functions, because even though the inputs to those functions appear to be the same in code and on a visualisation, they may be different at runtime.
This issue emerged for me where I attempted to use different pandera schemas with strict="filter" to split off, validate and process parts of the data separately before rejoining them. When those arms of the graph — effectively subgraphs — were run independently the results were as expected. But when the nodes were joined the outputs of one arm/subgraph were suddenly different because one validation step necessarily had to run before the other and in doing so modified the inputs to the other in a way that wasn't obvious. Thanks to @elijahbenizzy for helping to figure this out.
It appears from this line and those that follow (again thanks @elijahbenizzy for identifying) that the result of calling schema.validate is obtained but not returned — because the input is modified instead.
https://github.com/DAGWorks-Inc/hamilton/blob/e2f25b65079407da048c76989a08d5217ca16a1f/hamilton/data_quality/pandera_validators.py#L27
I found a temporary fix by changing occurrences of the following:
@check_output(schema=schema, …)
def my_func(my_df: pd.DataFrame): -> pd.DataFrame:
return my_df
to instead read something like:
@check_output(schema=schema, …)
def my_func(my_df: pd.DataFrame): -> pd.DataFrame:
return my_df.copy()
such that the in-place modified function result was not the same as the input my_df argument, which is the output of the preceding node.
Imho it would be better if the check_output decorator called schema.validate with inplace=False and modified the decorated function to return the result of that schema.validate call, rather than relying on the decorated function to explicitly ensure its inputs and outputs are not the same.
If however the current approach is preferred, I suggest some details be added to the documentation to make clearer that in-place modification will occur.
Thanks!
🤔 I don't recall why we put inplace=True (sigh, commit message doesn't mention 🤦 ). Might be an easy fix as you suggest.
🤔 I don't recall why we put
inplace=True(sigh, commit message doesn't mention 🤦 ). Might be an easy fix as you suggest.
Yep, agreed, I think @amosaikman you have the right fix. Just want to think for a minute to make sure its backwards compatible, but I think it should be for the supported cases.
@Amos can you please share an example Pandera schema that has some coercion defined in it? That way we can make sure we're testing the right thing.
On Wed, Aug 16, 2023, 9:33 PM Elijah ben Izzy @.***> wrote:
🤔 I don't recall why we put inplace=True (sigh, commit message doesn't mention 🤦 ). Might be an easy fix as you suggest.
Yep, agreed, I think @amosaikman https://github.com/amosaikman you have the right fix. Just want to think for a minute to make sure its backwards compatible, but I think it should be.
— Reply to this email directly, view it on GitHub https://github.com/DAGWorks-Inc/hamilton/issues/278#issuecomment-1681597343, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARYMB7IWOSZX6GQD4OW65DXVWNI3ANCNFSM6AAAAAA3TKC5WI . You are receiving this because you commented.Message ID: @.***>
Sure @skrawcz, let's take a simplified concrete example. We have some chemical data:
| ProjectCode | LocationID | SampleID | From | To | P_pct | Li_ppm |
|---|---|---|---|---|---|---|
| Test | D1 | SAMP001 | 0.0 | 1.0 | 0.001 | 0.74 |
| Test | D1 | SAMP002 | 1.0 | 2.0 | 0.001 | 0.74 |
| Test | D1 | SAMP003 | 2.0 | 3.0 | 0.001 | 0.92 |
| Test | D1 | SAMP004 | 3.0 | 4.0 | 0.001 | 1.09 |
| Test | D1 | SAMP005 | 4.0 | 5.0 | 0.001 | 1.18 |
First, let's ensure our index and column labels are ineligible:
pa.DataFrameSchema(
index=pa.Index(
str,
nullable=False,
unique=True,
coerce=True,
),
strict=False,
unique_column_names=True,
)
then we'll split off the elemental concentrations
pa.DataFrameSchema(
{
"^[A-Z][a-z]*_ppm$": pa.Column(
float,
nullable=True,
coerce=True,
checks=pa.Check.in_range(0, 1e6),
required=False,
regex=True,
title="Parts per million",
description="Element concentration in PPM",
),
...
},
checks=[
pa.Check(
lambda df: df.columns.str.split("_").str[0].isin(VALID_SYMBOLS).all(),
element_wise=False,
error="Invalid element symbol in column name",
),
],
strict="filter",
)
and take the other parts separately.
pa.DataFrameSchema(
{
"LocationID": pa.Column(
str,
coerce=True,
checks=pa.Check.str_length(min_value=1),
nullable=False,
required=True,
),
str(from_): pa.Column(
float,
coerce=True,
checks=pa.Check.ge(0),
nullable=False,
required=True,
),
...
},
strict="filter",
checks=pa.Check(
lambda df: df["From"] < df["To"],
),
)
This approach is convenient because it lets us pick groups of chemical columns using Pandera's regex support, something I don't think Hamilton currently handles independently. The trouble is if we pass the output of the node checked with the first schema to nodes checked with the next two schema and they modify the upstream output in place, whichever one runs first will remove the part of the df the next one is meant to find so it will return an empty df. That is because these schema have strict="filter" (cf. strict=True/False) so are effectively doing extract and validate at once.
Awesome thanks for the detailed response.
Just to ask the question, why not explicitly put pandera as part of the function, or a downstream one, rather than using it via @check_output if you want to modify the dataframe? My concern is that it wouldn't be obvious to me that @check_output would be modifying the dataframe.
That's a fair argument. I guess it goes to whether or not you agree with Pandera's approach of validating and coercing or validating and filtering as a single step. I don't think it changes the argument that if Hamilton permits Pandera coercion and filtering via check_output, it should either not mutate upstream outputs or make clearer that's a risk — imho.
That's a fair argument. I guess it goes to whether or not you agree with Pandera's approach of validating and coercing or validating and filtering as a single step. I don't think it changes the argument that if Hamilton permits Pandera coercion and filtering via
check_output, it should either not mutate upstream outputs or make clearer that's a risk — imho.
Yup. My inclination is that @check_output shouldn't mutate output (it's also simpler to implement). I guess we could have a flag to enable it (so it would be clear) if people wanted that behavior.
Otherwise looking at the code to enable your preference (of check_output returning a new dataframe), is a little bit of work it looks like :/ hmm @elijahbenizzy thoughts?
That's a fair argument. I guess it goes to whether or not you agree with Pandera's approach of validating and coercing or validating and filtering as a single step. I don't think it changes the argument that if Hamilton permits Pandera coercion and filtering via
check_output, it should either not mutate upstream outputs or make clearer that's a risk — imho.Yup. My inclination is that
@check_outputshouldn't mutate output (it's also simpler to implement). I guess we could have a flag to enable it (so it would be clear) if people wanted that behavior.Otherwise looking at the code to enable your preference (of check_output returning a new dataframe), is a little bit of work it looks like :/ hmm @elijahbenizzy thoughts?
Yeah, so I designed it to not allow modification (and I personally don't like the data validator also coerces strategy, but I get it, and definitely see the value.
I think rewiring wouldn't be too hard -- its just a matter of changing how it works. I do think we should bury it behind a parameter (modifies_original or something like that) that sets it up, and we can also have the data quality checker tell us if it has the capability to modify the data (so we can error out if you ask it to modify when its just doing validation).
Obviously it's entirely up to you guys what you do, but from a user perspective the following makes sense to me:
- a change that precludes silent upstream mutation by default
- an option to pass through parameters like
inplace=andlazy=from the decorator to the validator - some additions to the documentation about using
panderawithhamilton— especially around best practices, trade-offs and different approaches.
I generally agree that where strict="filter" in the schema, the operation is really more extraction rather than validation, and hence it is probably better run in a transform rather than on a decorator. That facilitates passing config params to the schema without @resolve. But I'm sure there's an edge case I haven't thought of where a filtering schema applied in a decorator is still the best approach.
Obviously it's entirely up to you guys what you do, but from a user perspective the following makes sense to me:
- a change that precludes silent upstream mutation by default
- an option to pass through parameters like
inplace=andlazy=from the decorator to the validator- some additions to the documentation about using
panderawithhamilton— especially around best practices, trade-offs and different approaches.I generally agree that where
strict="filter"in the schema, the operation is really more extraction rather than validation, and hence it is probably better run in a transform rather than on a decorator. That facilitates passingconfigparams to the schema without@resolve. But I'm sure there's an edge case I haven't thought of where a filtering schema applied in a decorator is still the best approach.
Yes, I think that's a pretty good approach. We should definitely change the default, but that'll be tied with a mode that allows mutation/not. It might even be something like check_output.mutating_schema that knows its allowed to mutate, and we can rewire the DAG accordingly. I highly doubt in_place ever makes sense, but it might be nice to expose for the sake of memory optimization. TBD on how exactly we expose that, but I think we can pass a parameter in to make that happen, or arbitrary params to the validator.
Just double-checking -- you're not blocked on this now, right?
No, not blocked. It was briefly infuriating when I couldn't figure out what was happening but actually turned out good coz it prompted me to rewrite some stuff that needed improving anyway. Tks for checking.
Closing for now.