chispa Add allow_nan_equality option to assert_approx_df

This pull request solves the issue by making a fairly big change to the API. Now, rather than having two assertion functions for both DataFrame and column comparison, there is just one for each. Both these assertion functions have an optional parameter for precision, meaning the assert_approx_df_equality and assert_approx_column_equality could be removed.

In addition to solving the ticket, I feel like I made a number of improvements to the code base:

Added some useful spacing
Improved the type hinting
Documented the functions with docstrings using the numpy docstring convention
Added CHANGELOG.md to be updated with each release

Actions for @MrPowers :

[ ] Review and confirm happy with API changes
[ ] Review renaming of exception to RowsNotEqualError and consider any implications
[ ] Update the screenshots in the README.md to reflect the new changes

When making a new release:

[ ] Bump to a new minor release, and use any other means to notify users of API change
[ ] Replace "Unreleased" in the CHANGELOG.md with the new version number and the date released.
[ ] Add the Changelog notes to the Release Notes on GitHub once all tagged.

Let me know what you think!

Closes #28

Apr 26 '21 16:04 mitches-got-glitches

@mitches-got-glitches - thanks for following up and sorry for the delay.

Here are my main objectives with this PR:

Get your new functionality included in the lib cause it's clearly an improvement
No backwards incompatible changes (this lib follows semantic versioning, so we could technically make backwards incompatible changes cause we haven't made a major release yet, but I'd rather just avoid breaking changes, like Spark does)
Keep these functions as fast as possible (cause Spark tests can get really slow). Don't want to unnecessarily do type checking and slow functions down.
Support all the Python versions that we should reasonably be supporting. @nchammas - do you know the Python versions that a modern Spark app should be supporting? I think supporting > 2.7 is too lenient cause Python 2 is EOL.

Looking forward to driving this towards completion & appreciate your persistence! BTW - I'm a Python noob, so you're teaching me a lot!

May 11 '21 18:05 MrPowers

@nchammas - do you know the Python versions that a modern Spark app should be supporting? I think supporting > 2.7 is too lenient cause Python 2 is EOL.

Python 3.6+ is what's currently supported. Spark 3.1 dropped support for Python 2.7, 3.4, and 3.5.

May 11 '21 18:05 nchammas

@nchammas - do you know the Python versions that a modern Spark app should be supporting? I think supporting > 2.7 is too lenient cause Python 2 is EOL.

Python 3.6+ is what's currently supported. Spark 3.1 dropped support for Python 2.7, 3.4, and 3.5.

Since the Python typing module is available in >=3.5 does this resolve your earlier comment @MrPowers ?

May 12 '21 12:05 mitches-got-glitches

Looks like Spark 2.4.5 technically supports Python 3.4+ and I'd still like to support the Spark 2 users for a bit longer. That said, seems reasonable to set the Python version as >=3.5 for this project. Python 3.5 was released in September 2015 and think most users will at least be using Python 3.5, even the ones that are still using Spark 2.

Sidenote, when chispa eventually drops support for PySpark 2, we should do a major release (i.e. bump chispa to 1.x).

I will bump the Python version in the project now and add a section to explain this to users.

May 12 '21 13:05 MrPowers

To add another data point, my team is currently using Python 3.6.8 and Spark 2.4.

May 12 '21 13:05 mitches-got-glitches

@mitches-got-glitches - great, glad we're still supporting PySpark 2.4 😉

Updated the required Python version.

So yes, the typing module question is resolved. We're requiring Python 3.5+ now, so using typing should be just fine 😃

May 12 '21 13:05 MrPowers

@mitches-got-glitches - BTW, see here for the chispa download statistics. If this lib didn't have any users yet, then I wouldn't really care about breaking API changes. Since there are a lot of users, that's why I'm also focusing on backwards compatibility. A blessing and a curse I suppose 😉

May 12 '21 14:05 MrPowers

Yeah, those are some good numbers! I think what you've set out to do is exactly right with the change to the readme and intention to launch 1.x. Users can always open a ticket for backwards compatibility if it's really necessary, or even create their own fork. But I agree, try not to get on their bad side ha.

So hopefully, once the thread above is resolved, then I think all that will be required is an update to the screenshots in the readme if you're happy with everything else @MrPowers ?

May 12 '21 16:05 mitches-got-glitches

@mitches-got-glitches - Thanks for following up and sorry I've been delayed responding. I've been working on my first Dask PR and it's been hard for me to get up-and-running on a completely new framework. I'm still a Python n00b.

My initial hesitation for merging this PR was that it wasn't 100% backwards compatible. Is that still the case? Or does the current PR introduce some backwards incompatible changes?

May 21 '21 12:05 MrPowers

@mitches-got-glitches - Thanks for following up and sorry I've been delayed responding. I've been working on my first Dask PR and it's been hard for me to get up-and-running on a completely new framework. I'm still a Python n00b.

My initial hesitation for merging this PR was that it wasn't 100% backwards compatible. Is that still the case? Or does the current PR introduce some backwards incompatible changes?

Hello, sorry for the slow reply. I think in terms of backwards compatibility there were two things:

The type hinting that requires Python 3.5+ (this was resolved in the comments above)
Backwards capability for the API (retaining assert_approx_df_equality and equivalent assert_approx_column_equality)

I believe this can be done by doing something like the following with a deprecation warning so users know to switch their code base over to assert_df_equality in future versions:

def assert_approx_df_equality(*args, **kwargs):
    assert_df_equality(*args, **kwargs)

Does that look like a good idea?

There are two more outstanding things I need your help on before this can come out of draft status:

Firstly, an answer to the replies to your comment above.
Updating the screenshots in the README.md to reflect the new changes

Are you able to support me with those @MrPowers ?

Jun 08 '21 17:06 mitches-got-glitches

chispa
chispa copied to clipboard

Add allow_nan_equality option to assert_approx_df_equality

chispa chispa copied to clipboard

Add allow_nan_equality option to assert_approx_df_equality

chispa
chispa copied to clipboard