PR Description

Please describe the changes proposed in the pull request:

polars submodule created.
clean_names function extend to work on Polars objects - works on a polars expression (cleans the column values) or a polars dataframe (cleans the column names).

This PR resolves #1343

PR Checklist

Please ensure that you have done the following:

[x] PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.

[x] If you're not on the contributors list, add yourself to AUTHORS.md.

[x] Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
- Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

Building a preview of the docs on Netlify
Automatically linting the code
Making sure the code is documented
Making sure that all tests are passed
Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@ericmjl

Apr 20 '24 09:04 samukweku

🚀 Deployed on https://deploy-preview-1351--pyjanitor.netlify.app

Apr 20 '24 10:04 ericmjl

@samukweku thanks for kickstarting this! I think I get where you're going with the PR, though I have some minor reservations about the current implementation. Would you be open to discussing here async before we commit to a decision on the PR?

My biggest reservation is that the implementation of clean_names in this PR deviates from the pattern that we use for pandas dataframes, namely, by registering a function as part of the dataframe global namespace. My secondary reservation is that the current implementation in this PR has code paths for polars and pandas within a single function. On thinking about the overall design and studying the API extension documentation and I think there may be a different way we could consider here.

Firstly, it may make our lives easier in the long run if we have the polars-enabled functions within their own submodules, following the pattern for spark dataframes. The main reason for this is the incompatibilities in the dataframe APIs -- a pet peeve of mine, but alas, a symptom of a vibrant ecosystem! While this arrangement means the polars submodule is likely to be less well-developed in the long run, it does reduce the energy barrier to having polars enthusiasts contribute to the library because of reduced mental overhead with adding a code path within existing functions. This arrangement should address the secondary reservation I wrote about above.

Secondly, by taking advantage of separate implementations and submodules, we can establish a suite of idioms to follow for creating polars-enabled functions. Early on, I would suggest that we simply follow the polars API extension documentation at first, but later on, we could create a polars-flavor library, akin to the pandas-flavor library, in which we use similar patterns to register a dataframe.jn namespace for polars-enabled functions. In this new world, each of the polars-enabled functions would be decorated as follows:

import polars_flavor as pf 

@pf.register_dataframe_method(namespace="jn") # jn is the default namespace, for janitor, but just showing here for completeness
def clean_names(...):
    ...

What are your thoughts, @samukweku?

Apr 20 '24 16:04 ericmjl

+1 on your suggestions @ericmjl ... Allows for easier/better maintenance. I will make the changes

Apr 20 '24 23:04 samukweku

pyjanitor
pyjanitor copied to clipboard

Samukweku/polars clean names

PR Description

PR Checklist

Automatic checks

Relevant Reviewers

pyjanitor pyjanitor copied to clipboard

Samukweku/polars clean names

PR Description

PR Checklist

Automatic checks

Relevant Reviewers

pyjanitor
pyjanitor copied to clipboard