pyjanitor
pyjanitor copied to clipboard
Samukweku/polars clean names
PR Description
Please describe the changes proposed in the pull request:
- polars submodule created.
clean_namesfunction extend to work on Polars objects - works on a polars expression (cleans the column values) or a polars dataframe (cleans the column names).
This PR resolves #1343
PR Checklist
Please ensure that you have done the following:
- [x] PR in from a fork off your branch. Do not PR from
<your_username>:dev, but rather from<your_username>:<feature-branch_name>.
- [x] If you're not on the contributors list, add yourself to
AUTHORS.md.
- [x] Add a line to
CHANGELOG.mdunder the latest version header (i.e. the one that is "on deck") describing the contribution.- Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.
Automatic checks
There will be automatic checks run on the PR. These include:
- Building a preview of the docs on Netlify
- Automatically linting the code
- Making sure the code is documented
- Making sure that all tests are passed
- Making sure that code coverage doesn't go down.
Relevant Reviewers
Please tag maintainers to review.
- @ericmjl
🚀 Deployed on https://deploy-preview-1351--pyjanitor.netlify.app
@samukweku thanks for kickstarting this! I think I get where you're going with the PR, though I have some minor reservations about the current implementation. Would you be open to discussing here async before we commit to a decision on the PR?
My biggest reservation is that the implementation of clean_names in this PR deviates from the pattern that we use for pandas dataframes, namely, by registering a function as part of the dataframe global namespace. My secondary reservation is that the current implementation in this PR has code paths for polars and pandas within a single function. On thinking about the overall design and studying the API extension documentation and I think there may be a different way we could consider here.
Firstly, it may make our lives easier in the long run if we have the polars-enabled functions within their own submodules, following the pattern for spark dataframes. The main reason for this is the incompatibilities in the dataframe APIs -- a pet peeve of mine, but alas, a symptom of a vibrant ecosystem! While this arrangement means the polars submodule is likely to be less well-developed in the long run, it does reduce the energy barrier to having polars enthusiasts contribute to the library because of reduced mental overhead with adding a code path within existing functions. This arrangement should address the secondary reservation I wrote about above.
Secondly, by taking advantage of separate implementations and submodules, we can establish a suite of idioms to follow for creating polars-enabled functions. Early on, I would suggest that we simply follow the polars API extension documentation at first, but later on, we could create a polars-flavor library, akin to the pandas-flavor library, in which we use similar patterns to register a dataframe.jn namespace for polars-enabled functions. In this new world, each of the polars-enabled functions would be decorated as follows:
import polars_flavor as pf
@pf.register_dataframe_method(namespace="jn") # jn is the default namespace, for janitor, but just showing here for completeness
def clean_names(...):
...
What are your thoughts, @samukweku?
+1 on your suggestions @ericmjl ... Allows for easier/better maintenance. I will make the changes