pyjanitor icon indicating copy to clipboard operation
pyjanitor copied to clipboard

Samukweku/polars clean names

Open samukweku opened this issue 1 year ago • 3 comments

PR Description

Please describe the changes proposed in the pull request:

  • polars submodule created.
  • clean_names function extend to work on Polars objects - works on a polars expression (cleans the column values) or a polars dataframe (cleans the column names).

This PR resolves #1343

PR Checklist

Please ensure that you have done the following:

  1. [x] PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
  1. [x] If you're not on the contributors list, add yourself to AUTHORS.md.
  1. [x] Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
    • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

  • Building a preview of the docs on Netlify
  • Automatically linting the code
  • Making sure the code is documented
  • Making sure that all tests are passed
  • Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

  • @ericmjl

samukweku avatar Apr 20 '24 09:04 samukweku

🚀 Deployed on https://deploy-preview-1351--pyjanitor.netlify.app

ericmjl avatar Apr 20 '24 10:04 ericmjl

@samukweku thanks for kickstarting this! I think I get where you're going with the PR, though I have some minor reservations about the current implementation. Would you be open to discussing here async before we commit to a decision on the PR?

My biggest reservation is that the implementation of clean_names in this PR deviates from the pattern that we use for pandas dataframes, namely, by registering a function as part of the dataframe global namespace. My secondary reservation is that the current implementation in this PR has code paths for polars and pandas within a single function. On thinking about the overall design and studying the API extension documentation and I think there may be a different way we could consider here.

Firstly, it may make our lives easier in the long run if we have the polars-enabled functions within their own submodules, following the pattern for spark dataframes. The main reason for this is the incompatibilities in the dataframe APIs -- a pet peeve of mine, but alas, a symptom of a vibrant ecosystem! While this arrangement means the polars submodule is likely to be less well-developed in the long run, it does reduce the energy barrier to having polars enthusiasts contribute to the library because of reduced mental overhead with adding a code path within existing functions. This arrangement should address the secondary reservation I wrote about above.

Secondly, by taking advantage of separate implementations and submodules, we can establish a suite of idioms to follow for creating polars-enabled functions. Early on, I would suggest that we simply follow the polars API extension documentation at first, but later on, we could create a polars-flavor library, akin to the pandas-flavor library, in which we use similar patterns to register a dataframe.jn namespace for polars-enabled functions. In this new world, each of the polars-enabled functions would be decorated as follows:

import polars_flavor as pf 

@pf.register_dataframe_method(namespace="jn") # jn is the default namespace, for janitor, but just showing here for completeness
def clean_names(...):
    ...

What are your thoughts, @samukweku?

ericmjl avatar Apr 20 '24 16:04 ericmjl

+1 on your suggestions @ericmjl ... Allows for easier/better maintenance. I will make the changes

samukweku avatar Apr 20 '24 23:04 samukweku