pandera Use generic types for config objects.

This is made to be consistent with typeshed best practices:

avoid invariant collection types (list, dict) in argument positions, in favor of covariant types like Mapping or Sequence

Specifically, we have a linter in house that complains about mutable defaults on class members:

class MySchema(pandera.SchemaModel):
    a: ...
    b: ...

    class Config:
        unique: List[str] = ["a", "b"]  # <- violation

I can't see any reason that unique must be a list rather than a tuple (why would we ever need to mutate this object?). I haven't looked hard enough to determine if I think that there is a "danger" if someone goes and MySchema.Config.unique.append("c") or accidentally mutates this via some oddball path of references at some other place in the code.

Aug 30 '23 15:08 mgilson

If anyone would write such a thing

MySchema.Config.unique.append("c")

they could as easily write

MySchema.Config.unique += ("c",)

Hardly a usecase we should be concerned about. :)

Sep 01 '23 00:09 joaoe

I think that the bigger question is whether we want to prevent someone from using immutable objects in the Config objects (and whether there is a strong reason that we have to have a list in the interface rather than a tuple or some other type of sequence)

Hardly a usecase we should be concerned about. :)

Yes, I mostly agree. IMHO, the primary reason to make this change is to provide more flexibility for an end user. I certainly agree that if someone wants to do something degenerate here, they can and if they are determined, there's probably no good way to stop them.

Having said that, I still think that (in the general case) using immutable type annotations can help users to avoid accidental bugs in circumstances like these.

def non_pure_function(x: MutableSequence[Any]) -> int:
    random.shuffle(x)
    return 7

n = non_pure_function(MySchema.Config.unique)
...

I sort of doubt that there is a use-case for doing something like this on these Config objects, but it still seems like a good best-practice/habit and -- in our case -- we have a linter that is enforcing this best-practice which is really why I care. I don't want to have to tell my linter to ignore this best practice on every one of these.

^{¹if it doesn't, we should use Collection instead of Sequence ;)}

Sep 01 '23 01:09 mgilson

Codecov Report

Patch coverage: 50.00% and no project coverage change.

Comparison is base (850dcf8) 76.48% compared to head (9e38577) 76.48%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1325   +/-   ##
=======================================
  Coverage   76.48%   76.48%           
=======================================
  Files          91       91           
  Lines        6676     6676           
=======================================
  Hits         5106     5106           
  Misses       1570     1570

Files Changed	Coverage Δ
pandera/api/pyspark/container.py	`0.00% <0.00%> (ø)`
pandera/api/pyspark/model_config.py	`0.00% <0.00%> (ø)`
pandera/api/pandas/container.py	`97.77% <100.00%> (ø)`
pandera/api/pandas/model_config.py	`100.00% <100.00%> (ø)`

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Sep 13 '23 14:09 codecov[bot]

thanks for the contribution @mgilson, please see the contributing guide to see how to run pre-commit and local tests

Apr 15 '24 02:04 cosmicBboy

pandera pandera copied to clipboard

Use generic types for config objects.

Codecov Report

pandera
pandera copied to clipboard