great-tables
great-tables copied to clipboard
Refactor the `data.__init__.py` module
Hello team,
This PR aims to address the Pandas dependency in reading datasets by introducing a unified Dataset API. The proposed approach allows users to retrieve datasets in a user-specified dataframe format. For example:
- To get a
Pandasdataframe for theszadataset, useDataset.sza.to_pandas(). - To get a
Polarsdataframe for the same dataset, useDataset.sza.to_polars(). - To get a
PyArrowdataframe for the same dataset, useDataset.sza.to_pyarrow()(implementation of_convert_to_pyarrow()andto_pyarrow()is needed to support this).
This way, users can use autocomplete to select both the dataset and the desired dataframe type.
To facilitate the transition:
- Each dataset name begins with an uppercase letter (e.g.,
Sza). - A lowercase variable (e.g.,
sza) is provided as aPandasdataframe, created usingto_pandas().
If we decide to completely remove Pandas as a dependency in the future, the following tasks will be required:
- Remove the warning message.
- Remove the lowercase variables representing
Pandasdataframes, and rename the dataset classes to lowercase. - Update
__all__. - Update documentation and tests to reflect the new API.
- Address code sections marked with
# remove pandasfor further cleanup.
I’m confident there are other excellent approaches to tackle this issue, so please feel free to modify or reject this PR as needed.