chispa
chispa copied to clipboard
Add allow_nan_equality option to assert_approx_df_equality
This pull request solves the issue by making a fairly big change to the API. Now, rather than having two assertion functions for both DataFrame and column comparison, there is just one for each. Both these assertion functions have an optional parameter for precision
, meaning the assert_approx_df_equality
and assert_approx_column_equality
could be removed.
In addition to solving the ticket, I feel like I made a number of improvements to the code base:
- Added some useful spacing
- Improved the type hinting
- Documented the functions with docstrings using the
numpy
docstring convention - Added CHANGELOG.md to be updated with each release
Actions for @MrPowers :
- [ ] Review and confirm happy with API changes
- [ ] Review renaming of exception to
RowsNotEqualError
and consider any implications - [ ] Update the screenshots in the README.md to reflect the new changes
When making a new release:
- [ ] Bump to a new minor release, and use any other means to notify users of API change
- [ ] Replace "Unreleased" in the CHANGELOG.md with the new version number and the date released.
- [ ] Add the Changelog notes to the Release Notes on GitHub once all tagged.
Let me know what you think!
Closes #28
@mitches-got-glitches - thanks for following up and sorry for the delay.
Here are my main objectives with this PR:
- Get your new functionality included in the lib cause it's clearly an improvement
- No backwards incompatible changes (this lib follows semantic versioning, so we could technically make backwards incompatible changes cause we haven't made a major release yet, but I'd rather just avoid breaking changes, like Spark does)
- Keep these functions as fast as possible (cause Spark tests can get really slow). Don't want to unnecessarily do type checking and slow functions down.
- Support all the Python versions that we should reasonably be supporting. @nchammas - do you know the Python versions that a modern Spark app should be supporting? I think supporting > 2.7 is too lenient cause Python 2 is EOL.
Looking forward to driving this towards completion & appreciate your persistence! BTW - I'm a Python noob, so you're teaching me a lot!
@nchammas - do you know the Python versions that a modern Spark app should be supporting? I think supporting > 2.7 is too lenient cause Python 2 is EOL.
Python 3.6+ is what's currently supported. Spark 3.1 dropped support for Python 2.7, 3.4, and 3.5.
@nchammas - do you know the Python versions that a modern Spark app should be supporting? I think supporting > 2.7 is too lenient cause Python 2 is EOL.
Python 3.6+ is what's currently supported. Spark 3.1 dropped support for Python 2.7, 3.4, and 3.5.
Since the Python typing
module is available in >=3.5 does this resolve your earlier comment @MrPowers ?
Looks like Spark 2.4.5 technically supports Python 3.4+ and I'd still like to support the Spark 2 users for a bit longer. That said, seems reasonable to set the Python version as >=3.5 for this project. Python 3.5 was released in September 2015 and think most users will at least be using Python 3.5, even the ones that are still using Spark 2.
Sidenote, when chispa eventually drops support for PySpark 2, we should do a major release (i.e. bump chispa to 1.x).
I will bump the Python version in the project now and add a section to explain this to users.
To add another data point, my team is currently using Python 3.6.8 and Spark 2.4.
@mitches-got-glitches - great, glad we're still supporting PySpark 2.4 😉
Updated the required Python version.
So yes, the typing
module question is resolved. We're requiring Python 3.5+ now, so using typing
should be just fine 😃
@mitches-got-glitches - BTW, see here for the chispa download statistics. If this lib didn't have any users yet, then I wouldn't really care about breaking API changes. Since there are a lot of users, that's why I'm also focusing on backwards compatibility. A blessing and a curse I suppose 😉
Yeah, those are some good numbers! I think what you've set out to do is exactly right with the change to the readme and intention to launch 1.x. Users can always open a ticket for backwards compatibility if it's really necessary, or even create their own fork. But I agree, try not to get on their bad side ha.
So hopefully, once the thread above is resolved, then I think all that will be required is an update to the screenshots in the readme if you're happy with everything else @MrPowers ?
@mitches-got-glitches - Thanks for following up and sorry I've been delayed responding. I've been working on my first Dask PR and it's been hard for me to get up-and-running on a completely new framework. I'm still a Python n00b.
My initial hesitation for merging this PR was that it wasn't 100% backwards compatible. Is that still the case? Or does the current PR introduce some backwards incompatible changes?
@mitches-got-glitches - Thanks for following up and sorry I've been delayed responding. I've been working on my first Dask PR and it's been hard for me to get up-and-running on a completely new framework. I'm still a Python n00b.
My initial hesitation for merging this PR was that it wasn't 100% backwards compatible. Is that still the case? Or does the current PR introduce some backwards incompatible changes?
Hello, sorry for the slow reply. I think in terms of backwards compatibility there were two things:
- The type hinting that requires Python 3.5+ (this was resolved in the comments above)
- Backwards capability for the API (retaining
assert_approx_df_equality
and equivalentassert_approx_column_equality
)
I believe this can be done by doing something like the following with a deprecation warning so users know to switch their code base over to assert_df_equality
in future versions:
def assert_approx_df_equality(*args, **kwargs):
assert_df_equality(*args, **kwargs)
Does that look like a good idea?
There are two more outstanding things I need your help on before this can come out of draft status:
- Firstly, an answer to the replies to your comment above.
- Updating the screenshots in the README.md to reflect the new changes
Are you able to support me with those @MrPowers ?