pandera
pandera copied to clipboard
Use generic types for config objects.
This is made to be consistent with typeshed best practices:
avoid invariant collection types (list, dict) in argument positions, in favor of covariant types like Mapping or Sequence
Specifically, we have a linter in house that complains about mutable defaults on class members:
class MySchema(pandera.SchemaModel):
a: ...
b: ...
class Config:
unique: List[str] = ["a", "b"] # <- violation
I can't see any reason that unique must be a list rather than a tuple (why would we ever need to mutate this object?). I haven't looked hard enough to determine if I think that there is a "danger" if someone goes and MySchema.Config.unique.append("c") or accidentally mutates this via some oddball path of references at some other place in the code.
If anyone would write such a thing
MySchema.Config.unique.append("c")
they could as easily write
MySchema.Config.unique += ("c",)
Hardly a usecase we should be concerned about. :)
I think that the bigger question is whether we want to prevent someone from using immutable objects in the Config objects (and whether there is a strong reason that we have to have a list in the interface rather than a tuple or some other type of sequence)
Hardly a usecase we should be concerned about. :)
Yes, I mostly agree. IMHO, the primary reason to make this change is to provide more flexibility for an end user. I certainly agree that if someone wants to do something degenerate here, they can and if they are determined, there's probably no good way to stop them.
Having said that, I still think that (in the general case) using immutable type annotations can help users to avoid accidental bugs in circumstances like these.
def non_pure_function(x: MutableSequence[Any]) -> int:
random.shuffle(x)
return 7
n = non_pure_function(MySchema.Config.unique)
...
I sort of doubt that there is a use-case for doing something like this on these Config objects, but it still seems like a good best-practice/habit and -- in our case -- we have a linter that is enforcing this best-practice which is really why I care. I don't want to have to tell my linter to ignore this best practice on every one of these.
1if it doesn't, we should use Collection instead of Sequence ;)
Codecov Report
Patch coverage: 50.00% and no project coverage change.
Comparison is base (
850dcf8) 76.48% compared to head (9e38577) 76.48%.
Additional details and impacted files
@@ Coverage Diff @@
## main #1325 +/- ##
=======================================
Coverage 76.48% 76.48%
=======================================
Files 91 91
Lines 6676 6676
=======================================
Hits 5106 5106
Misses 1570 1570
| Files Changed | Coverage Δ | |
|---|---|---|
| pandera/api/pyspark/container.py | 0.00% <0.00%> (ø) |
|
| pandera/api/pyspark/model_config.py | 0.00% <0.00%> (ø) |
|
| pandera/api/pandas/container.py | 97.77% <100.00%> (ø) |
|
| pandera/api/pandas/model_config.py | 100.00% <100.00%> (ø) |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
thanks for the contribution @mgilson, please see the contributing guide to see how to run pre-commit and local tests