pandas-dataclasses
pandas-dataclasses copied to clipboard
ENH: pyarrow and optionally pydantic
What should be the API for working with pandas, pyarrow, and dataclasses and/or pydantic?
-
Pandas 2.0 supports pyarrow for so many things now, and pydantic does data validation with a drop-in
dataclasses.dataclassreplacement atpydantic.dataclasses.dataclass.- https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#argument-dtype-backend-to-return-pyarrow-backed-or-numpy-backed-nullable-dtypes
pd.read_*(**, dtype_backend="pyarrow") - https://pandas.pydata.org/docs/dev/user_guide/pyarrow.html
- https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.convert_dtypes.html#pandas.DataFrame.convert_dtypes
- https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#argument-dtype-backend-to-return-pyarrow-backed-or-numpy-backed-nullable-dtypes
-
https://www.google.com/search?q=pyarrow+dataclasses
- https://github.com/freegor/be-pydantic/blob/main/main.py
- https://arrow.apache.org/docs/python/data.html
- https://arrow.apache.org/docs/python/pandas.html
- https://github.com/apache/arrow/blob/main/python/pyarrow/cffi.py
- https://github.com/apache/arrow/blob/main/python/pyarrow/src/arrow/python/python_to_arrow.cc
- https://github.com/apache/arrow/blob/main/python/pyarrow/src/arrow/python/numpy_to_arrow.cc
- https://github.com/apache/arrow/blob/main/python/pyarrow/src/arrow/python/arrow_to_pandas.cc
- https://github.com/apache/arrow/blob/main/python/pyarrow/dataset.py
- https://github.com/apache/arrow/blob/97821aa5af650e6478116cae7c0128fe37dad067/python/pyarrow/tests/test_pandas.py#L146 TestConvertMetadata
- https://github.com/apache/arrow/blob/main/python/pyarrow/tests/test_schema.py#L221 test_schema*()
-
https://www.google.com/search?q=pydantic+dataclasses
- https://docs.pydantic.dev/usage/dataclasses/
https://github.com/pydantic/pydantic/blob/main/docs/usage/dataclasses.md-
If you don't want to use pydantic's BaseModel you can instead get the same data validation on standard dataclasses
-
Difference with stdlib dataclasses¶ Note that the
dataclasses.dataclassfrom Python stdlib implements only the__post_init__method since it doesn't run a validation step.When substituting usage of
dataclasses.dataclasswithpydantic.dataclasses.dataclass, it is recommended to move the code executed in the__post_init__method to the__post_init_post_parse__method, and only leave behind part of code which needs to be executed before validation. https://docs.pydantic.dev/usage/dataclasses/#difference-with-stdlib-dataclasses
-
- https://github.com/pydantic/pydantic/blob/main/pydantic/dataclasses.py
- https://github.com/pydantic/pydantic/blob/main/pydantic/_internal/_dataclasses.py
- https://github.com/pydantic/pydantic/tree/main/docs/examples/ dataclasses*.py
- https://github.com/pydantic/pydantic/blob/main/tests/test_dataclasses.py
@pydantic.dataclasses.dataclass
- https://docs.pydantic.dev/usage/dataclasses/
FWIW, re: data validation these days:pydantic_schemaorg validates with schema.org schema, and there's QuantitativeValue[Distribution], CSVW (CSV on the Web) is a standard for CSV in RDF, RDF has many representations: RDF/XML, Turtle (.ttl), JSON-LD (.json, application/ld+json), RDFa (RDF-in-(HTML)-Attributes), some applications - including search engines - work with at least bibliographic linked data like for subtypes of https://schema.org/CreativeWork such as https://schema.org/ScholarlyArticle and :Dataset and :DataCatalog.
Other existing standards for data schema and/or validation: SDMX (pandaSDMX,), W3C Data Cubes (pandas-datacube,), JSONschema (pydantic, react-jsonschema-form,) and W3C SHACL (Schema.org,)
-
https://github.com/lexiq-legal/pydantic_schemaorg generates templated pydantic
.pysource files containing validators for all of therdfs:Classandrdfs:Propertydefined in a release of the https://schema.org/ meta-vocabulary- https://github.com/schemaorg/schemaorg/tree/v15.0-release/data/releases/15.0
- https://github.com/schemaorg/schemaorg/blob/v15.0-release/data/releases/15.0/schemaorg-all-https.ttl
- https://github.com/schemaorg/schemaorg/blob/v15.0-release/data/releases/15.0/schemaorg-shapes.shacl
- https://github.com/schemaorg/schemaorg/blob/v15.0-release/data/releases/15.0/schemaorg-subclasses.shacl
For example:
- https://github.com/lexiq-legal/pydantic_schemaorg/blob/main/pydantic_schemaorg/Quantity.py https://schema.org/Quantity
- https://github.com/lexiq-legal/pydantic_schemaorg/blob/main/pydantic_schemaorg/QuantitativeValue.py https://schema.org/QuantitativeValue
- https://github.com/lexiq-legal/pydantic_schemaorg/blob/main/pydantic_schemaorg/QuantitativeValueDistribution.py https://schema.org/QuantitativeValueDistribution
-
W3C SHACL
- pydantic does not (yet?) do W3C SHACL validation
- hhttps://github.com/siqueirarenan/shacl-jsonschema-converter
- https://github.com/mulesoft-labs/json-ld-schema
- https://github.com/RDFLib/pySHACL
- https://github.com/RDFLib/pySHACL/blob/master/test/test_schema_org.py
- https://github.com/RDFLib/pySHACL/tree/master/test/test_js
-
[ ] https://github.com/pandas-dev/pandas/issues/3402
- https://arrow.apache.org/docs/python/data.html#custom-schema-and-field-metadata
- https://github.com/pandas-dev/pandas/issues/2485
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.attrs.html
DataFrame.attrsis a dict that anything can or could modify upon read, transformation, or write; and may not be persisted by file formats that do not support an auxiliary metadata file - https://pandas.pydata.org/pandas-docs/stable/development/extending.html#define-original-properties
class DataFrameWithNonAttrsMetadata(pd.DataFrame): _metadata = ["additional_attrs", "prov"]
-
W3C PROV is a Linked Data specification for specifying data provenance information: who, what, when, how, etc.
What does that mean for pandas and dataclasses and pyarrow and optionally pydantic?
- [ ] How should additional per-field metadata be specified with type annotations (if type annotations are syntactically sufficient and preferable)?
- [ ] Linked Data is about URIs. How should URIs be specified when specifying data validation schema which are essential to data quality?