great-tables icon indicating copy to clipboard operation
great-tables copied to clipboard

Refactor the `data.__init__.py` module

Open jrycw opened this issue 1 year ago • 11 comments

Hello team,

This PR aims to address the Pandas dependency in reading datasets by introducing a unified Dataset API. The proposed approach allows users to retrieve datasets in a user-specified dataframe format. For example:

  • To get a Pandas dataframe for the sza dataset, use Dataset.sza.to_pandas().
  • To get a Polars dataframe for the same dataset, use Dataset.sza.to_polars().
  • To get a PyArrow dataframe for the same dataset, use Dataset.sza.to_pyarrow() (implementation of _convert_to_pyarrow() and to_pyarrow() is needed to support this).

This way, users can use autocomplete to select both the dataset and the desired dataframe type.

To facilitate the transition:

  • Each dataset name begins with an uppercase letter (e.g., Sza).
  • A lowercase variable (e.g., sza) is provided as a Pandas dataframe, created using to_pandas().

If we decide to completely remove Pandas as a dependency in the future, the following tasks will be required:

  • Remove the warning message.
  • Remove the lowercase variables representing Pandas dataframes, and rename the dataset classes to lowercase.
  • Update __all__.
  • Update documentation and tests to reflect the new API.
  • Address code sections marked with # remove pandas for further cleanup.

I’m confident there are other excellent approaches to tackle this issue, so please feel free to modify or reject this PR as needed.

jrycw avatar Nov 28 '24 14:11 jrycw