pingouin icon indicating copy to clipboard operation
pingouin copied to clipboard

Support for array-like and pandas-like data

Open smathot opened this issue 6 months ago • 9 comments

This is a small patch that does not change any functionality or fixes issues, but rather makes Pingouin more flexible in the types of data that it accepts. Specifically, right now Pingouin does not (always) accept data that is compatible with numpy.ndarray or pandas.DataFrame when it is not exactly of that type. This patch adds some flexibility by checking whether data can be converted to an ndarray or DataFrame when it is not of those types. It seems that data-type checking happens at different places throughout the code, so I'm not sure I caught everything.

For me, the main reason to submit this PR is to make DataMatrix 2.0, which is close to fully compatible with DataFrame, also fully compatible with Pingouin. With this patch it is. The DataMatrix unit tests for now still fail on GitHub, but that's because it's pulling the latest stable version of Pingouin.

All Pingouin unittests pass with this patch.

smathot avatar Jun 19 '25 15:06 smathot

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 98.54%. Comparing base (c939d2d) to head (41df958).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #470   +/-   ##
=======================================
  Coverage   98.54%   98.54%           
=======================================
  Files          19       19           
  Lines        3360     3360           
  Branches      492      492           
=======================================
  Hits         3311     3311           
  Misses         26       26           
  Partials       23       23           

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jun 19 '25 15:06 codecov[bot]

I'm hesitant to support this for fear of getting into a pattern of accommodating compatibility with niche packages. @smathot, is it too much for DataMatrix users to simply use the to-pandas features of DataMatrix right before calling Pingouin?

My primary concerns are:

  1. Checking for a __dataframe__ attribute doesn't have any immediate red flags to me now, but it seems more principled to check against the more reliable and consistent DataFrame type. This might introduce confusion later on, like if someone is using custom code with the same attribute.
  2. Accommodating one niche package now opens up the floodgates for the many niche communities that use Pingouin to do something similar, which could get messy.

It seems to me that DataMatrix could deal with the one line conversion prior to calling Pingouin, or alternatively the onus could be on the DataMatrix code base to add a method to DataMatrix objects that call Pingouin and do the conversion on the DataMatrix end. What do you think @smathot ? Am I missing a complexity here that makes this change to Pingouin more critical for your users?

remrama avatar Jun 19 '25 17:06 remrama

Yes, I assumed that you wouldn't want to make ad-hoc exceptions. For this reason, you'll see that I avoided the patch from specifically targeting datamatrix, and instead used the DataFrame interchange protocol. This is an attempt at improving interoperability between packages that somehow represent data (not just DataMatrix), and essentially comes down to libraries implementing a __dataframe__() function. The same is true for the __array__() function when it comes NumPy, although that seems more like a soft standard that doesn't really have a name.

I think you can safely adopt these patches, because they don't have any disadvantage for existing users, are not specific to one package, and they have a small benefit of making it easier for other data packages to work with Pingouin :relaxed:

smathot avatar Jun 19 '25 17:06 smathot

Oh cool, thanks for the explanation and link @smathot, I wasn't aware of this interchange protocol. I see how this could be useful.

It's hard for me to tell from that link how actively adopted this is. But some quick searching makes me skeptical that this has a lot of support these days. Again, please correct me! Sounds like a cool idea and I'd be happy to see any links to active projects that are developing this protocol, and especially some major projects that are actively utilizing it. Maybe the repo was transfered somewhere else or something and I'm looking at the wrong thing. But here's what I found:

  • The latest open Issue on the dataframe-api github repo is a request to archive the project (data-apis/dataframe-api#363)
  • From Pandas, the api.interchange.from_dataframe docs page has a Note about deprecation ("From pandas 2.3 onwards, from_dataframe uses the PyCapsule Interface, only falling back to the interchange protocol if that fails") and a Warning about "sever implementation issues". Maybe you want to use the "PyCapsule interface" mentioned here? (I didn't look into that)

If I'm wrong and this is an actively-supported thing and the community thinks this is a good idea for Pingouin, I would still recommend:

  • adding unit-tests that include non-pandas dataframes we expect to be compatible (e.g., polars)
  • adding an explanation or link to this interchange protocol thing in a comment line above the assertions
  • minor update to assertion statement that says something like "must be dataframe-like" or "must conform to dataframe API standard", maybe including a link (these last 2 points are just to help someone like me who had no idea about this)

remrama avatar Jun 19 '25 23:06 remrama

Reading this, I think your assessment is regarding the DataFrame protocol is probably correct, and that not many libraries except for DataFrame and DataMatrix are going to implement a __dataframe__ property. Checking for a __dataframe__ property still seems safe to me, with little practical disadvantage, and as I mentioned with a small advantage.

When it comes to checking for __array__, I think there are more obvious data types that benefit, such as tensors from machine learning libraries, which are widely used. (Though of course not typically in the context that pingouin is used.)

It's up to you! I would of course prefer this patch to be merged, because DataMatrix benefits from it. If you agree, then I can also add the polish you suggest. But if not, then no hard feelings!

smathot avatar Jun 20 '25 09:06 smathot

Okay, that makes sense. Thanks for the update.

I would still vote no on __dataframe__, mostly just to (a) keep Pingouin's underlying codebase as simple/straight-forward as possible and (b) prevent unforeseen issues with it in the future. I agree with you that it seems safe at first glance now, but I don't know how it would play out. There isn't a lot of active development on Pingouin right now, making it even more important to keep the risks low. (I don't really see a benefit of checking for __array__ either. I am not aware of any big need for that, and again I prefer to keep the codebase as self-explanatory as possible.)

But this decision is most definitely not up to only me! I have a minor voice here and we will have to wait for others to chime in before moving forward.

remrama avatar Jun 22 '25 02:06 remrama

Hi there,

Thanks for the PR and detailed explanation @smathot. I tend to agree with @remrama that I don't see a clear benefit of adding this to Pingouin. Would the conversion that @remrama mentioned be feasible on your side?

It seems to me that DataMatrix could deal with the one line conversion prior to calling Pingouin, or alternatively the onus could be on the DataMatrix code base to add a method to DataMatrix objects that call Pingouin and do the conversion on the DataMatrix end.

Thank you both, Raphael

raphaelvallat avatar Jun 22 '25 12:06 raphaelvallat

Would the conversion that @remrama mentioned be feasible on your side?

Sure, it's possible. But it means that I have to instruct users with something like: "You can pass a DataMatrix to any function that accepts a DataFrame, except for the following pingouin functions: …". And the same is true for array-like objects like tensors. I don't fully understand the hesitation, to be honest! After all, this is a safe PR with no disadvantages, no meaningful decrease in readability, and only minor advantages. But it's 100% your call! Feel free to close the PR if it doesn't work for you.

smathot avatar Jun 23 '25 15:06 smathot

If you're looking for a standardised interface, Narwhals that's widely adopted, actively maintained, and is used by several major libraries. I started it out of frustration with the dataframe api going nowhere and staying too tied to pandas syntax

MarcoGorelli avatar Aug 15 '25 09:08 MarcoGorelli