SDV
SDV copied to clipboard
Support pyarrow-backed DataFrames and pyarrow data types
Problem Description
- Starting pandas 3.0, pyarrow will become a required runtime dependency.
- pyarrow-backed DataFrames use pyarrow data types (rather than numpy), which are faster and have more memory-efficient operations. This is because the PyArrow backend uses Apache Arrow as an alternative data storage format for Pandas DataFrames and Series.
- Currently, sdtypes expect numpy-backed DataFrames (np.X data type).
- Users would benefit from supporting these data types. Additionally, it would allow for nullable data in data types not currently support (for example Boolean columns could have null values).
Expected behavior
- SDV is able to support pyarrow-backed DataFrames and maintains support for numpy-backed DataFrames.
- SDV will deprecate support for numpy-backed DataFrames (as pandas will eventually as well)
Additional context
- https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html
- Benefits of pyarrow-backed DataFrames can be found under motivation:
- https://github.com/pandas-dev/pandas/milestone/102
I love this