pandas-dataclasses icon indicating copy to clipboard operation
pandas-dataclasses copied to clipboard

ENH: pyarrow and optionally pydantic

Open westurner opened this issue 2 years ago • 1 comments

What should be the API for working with pandas, pyarrow, and dataclasses and/or pydantic?

  • Pandas 2.0 supports pyarrow for so many things now, and pydantic does data validation with a drop-in dataclasses.dataclass replacement at pydantic.dataclasses.dataclass.

    • https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#argument-dtype-backend-to-return-pyarrow-backed-or-numpy-backed-nullable-dtypes pd.read_*(**, dtype_backend="pyarrow")
    • https://pandas.pydata.org/docs/dev/user_guide/pyarrow.html
    • https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.convert_dtypes.html#pandas.DataFrame.convert_dtypes
  • https://www.google.com/search?q=pyarrow+dataclasses

    • https://github.com/freegor/be-pydantic/blob/main/main.py
    • https://arrow.apache.org/docs/python/data.html
    • https://arrow.apache.org/docs/python/pandas.html
    • https://github.com/apache/arrow/blob/main/python/pyarrow/cffi.py
      • https://github.com/apache/arrow/blob/main/python/pyarrow/src/arrow/python/python_to_arrow.cc
      • https://github.com/apache/arrow/blob/main/python/pyarrow/src/arrow/python/numpy_to_arrow.cc
      • https://github.com/apache/arrow/blob/main/python/pyarrow/src/arrow/python/arrow_to_pandas.cc
    • https://github.com/apache/arrow/blob/main/python/pyarrow/dataset.py
    • https://github.com/apache/arrow/blob/97821aa5af650e6478116cae7c0128fe37dad067/python/pyarrow/tests/test_pandas.py#L146 TestConvertMetadata
    • https://github.com/apache/arrow/blob/main/python/pyarrow/tests/test_schema.py#L221 test_schema*()
  • https://www.google.com/search?q=pydantic+dataclasses

    • https://docs.pydantic.dev/usage/dataclasses/
      https://github.com/pydantic/pydantic/blob/main/docs/usage/dataclasses.md
      • If you don't want to use pydantic's BaseModel you can instead get the same data validation on standard dataclasses

      • Difference with stdlib dataclasses Note that the dataclasses.dataclass from Python stdlib implements only the __post_init__ method since it doesn't run a validation step.

        When substituting usage of dataclasses.dataclass with pydantic.dataclasses.dataclass, it is recommended to move the code executed in the __post_init__ method to the __post_init_post_parse__ method, and only leave behind part of code which needs to be executed before validation. https://docs.pydantic.dev/usage/dataclasses/#difference-with-stdlib-dataclasses

    • https://github.com/pydantic/pydantic/blob/main/pydantic/dataclasses.py
    • https://github.com/pydantic/pydantic/blob/main/pydantic/_internal/_dataclasses.py
    • https://github.com/pydantic/pydantic/tree/main/docs/examples/ dataclasses*.py
    • https://github.com/pydantic/pydantic/blob/main/tests/test_dataclasses.py @pydantic.dataclasses.dataclass

westurner avatar Mar 20 '23 16:03 westurner

FWIW, re: data validation these days:pydantic_schemaorg validates with schema.org schema, and there's QuantitativeValue[Distribution], CSVW (CSV on the Web) is a standard for CSV in RDF, RDF has many representations: RDF/XML, Turtle (.ttl), JSON-LD (.json, application/ld+json), RDFa (RDF-in-(HTML)-Attributes), some applications - including search engines - work with at least bibliographic linked data like for subtypes of https://schema.org/CreativeWork such as https://schema.org/ScholarlyArticle and :Dataset and :DataCatalog. Other existing standards for data schema and/or validation: SDMX (pandaSDMX,), W3C Data Cubes (pandas-datacube,), JSONschema (pydantic, react-jsonschema-form,) and W3C SHACL (Schema.org,)

  • https://github.com/lexiq-legal/pydantic_schemaorg generates templated pydantic .py source files containing validators for all of the rdfs:Class and rdfs:Property defined in a release of the https://schema.org/ meta-vocabulary

    • https://github.com/schemaorg/schemaorg/tree/v15.0-release/data/releases/15.0
    • https://github.com/schemaorg/schemaorg/blob/v15.0-release/data/releases/15.0/schemaorg-all-https.ttl
    • https://github.com/schemaorg/schemaorg/blob/v15.0-release/data/releases/15.0/schemaorg-shapes.shacl
    • https://github.com/schemaorg/schemaorg/blob/v15.0-release/data/releases/15.0/schemaorg-subclasses.shacl

    For example:

    • https://github.com/lexiq-legal/pydantic_schemaorg/blob/main/pydantic_schemaorg/Quantity.py https://schema.org/Quantity
    • https://github.com/lexiq-legal/pydantic_schemaorg/blob/main/pydantic_schemaorg/QuantitativeValue.py https://schema.org/QuantitativeValue
    • https://github.com/lexiq-legal/pydantic_schemaorg/blob/main/pydantic_schemaorg/QuantitativeValueDistribution.py https://schema.org/QuantitativeValueDistribution
  • W3C SHACL

    • pydantic does not (yet?) do W3C SHACL validation
    • hhttps://github.com/siqueirarenan/shacl-jsonschema-converter
    • https://github.com/mulesoft-labs/json-ld-schema
    • https://github.com/RDFLib/pySHACL
    • https://github.com/RDFLib/pySHACL/blob/master/test/test_schema_org.py
    • https://github.com/RDFLib/pySHACL/tree/master/test/test_js
  • [ ] https://github.com/pandas-dev/pandas/issues/3402

    • https://arrow.apache.org/docs/python/data.html#custom-schema-and-field-metadata
    • https://github.com/pandas-dev/pandas/issues/2485
    • https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.attrs.html DataFrame.attrs is a dict that anything can or could modify upon read, transformation, or write; and may not be persisted by file formats that do not support an auxiliary metadata file
    • https://pandas.pydata.org/pandas-docs/stable/development/extending.html#define-original-properties
      class DataFrameWithNonAttrsMetadata(pd.DataFrame):
           _metadata = ["additional_attrs", "prov"]
      
  • W3C PROV is a Linked Data specification for specifying data provenance information: who, what, when, how, etc.

What does that mean for pandas and dataclasses and pyarrow and optionally pydantic?

  • [ ] How should additional per-field metadata be specified with type annotations (if type annotations are syntactically sufficient and preferable)?
  • [ ] Linked Data is about URIs. How should URIs be specified when specifying data validation schema which are essential to data quality?

westurner avatar Mar 20 '23 17:03 westurner