ydata-profiling
ydata-profiling copied to clipboard
Support for PySpark / Spark dataframes?
Would be super great to have PySpark / Spark dataframe functionality for this package as our team is using Spark as our scalable backend.
Thanks so much!
Planned for end of this year. Any contributions are welcome.
is there a branch for this or an implementation plan?
@skorski There is currently no branch for this. There used to be a version of PP that was executing a Spark backend. That implementation used the pyspark.sql
module to generate the statistics. Feel welcome to contribute if you'd like.
ps. (PNNL in) Richland rang a bell, turns out we passed it when visiting Gravity Hill
hi all!
Have taken a stab at this @ https://github.com/pandas-profiling/pandas-profiling/wiki/Spark-Development-Plan
The plan is still a WIP, but if you guys think its a viable overall approach would be happy to keep working on this! Also, please drop me an email at [email protected] if you would like write access on the doc, happy to collaborate on this with anyone!
The working document for the implementation plan can be found here: https://github.com/pandas-profiling/pandas-profiling/wiki/Spark-Development-Plan. Contributions are welcome.
(Thanks to @chanedwin)
@sbrugman The implementation plan is looking really good. I'll try to dig into it a bit and help where I can. Where is the best place to post questions?
What about leveraging koalas ?, this would be a huge shortcut?
@skorski The Slack community for pandas-profiling can be used for that: https://join.slack.com/t/pandas-profiling/shared_invite/zt-hfy3iwp2-qEJSItye5QBZf8YGFMaMnQ
@ahmedanis03 Thank you for the suggestion, we're also considering koalas. The bulk of the work seems to be in refactoring and specific features more or less regardless of API (correlations, missing diagrams).
There is already a project that was built around porting pandas-profiling to Spark: https://github.com/julioasotodv/spark-df-profiling I wonder if its code base could be of any help.
What about simply replacing pandas with koalas? https://github.com/databricks/koalas? Maybe it is worth a shot?
I used to develop a big data profiling library based on Spark and also explored for such good open source solutions. Later, when I came across pandas-profiling, I give us other solutions and have been quite happy with pandas-profiling. I have been using pandas-profiling to profile large production too. The simple trick is to randomly sample data from Spark cluster and get it to one machine for data profiling using pandas-profiling. Do we really need to profile on the whole large data? For most non-extreme metrics, the answer is no. A 100K row will likely give you accurate enough information about the population. For extreme metrics such as max, min, etc., I calculated them by myself.
If pandas-profiling is going to support profiling large data, this might be the easiest but good-enough way. Just my 2 cents.
Is there a development branch for this if we want to contribute?
Hey all, is there any ongoing development for this Spark backend? We're thinking of adopting this tool, but as our datasets grow we're eventually going to have to use Spark for our profiling.
Hey @ncoish, yes it's ongoing ! @chanedwin has done the lion's share of the work needed for the Spark backend, which now needs to be integrated
Hey @ncoish, yes it's ongoing ! @chanedwin has done the lion's share of the work needed for the Spark backend, which now needs to be integrated
Few more weeks/months of wait?
I would also like to know where this code lives so I can help out. One thing I would like to put forth to be considered is being able to pull out the underlying Spark SQL queries that form the profiles into spark to avoid pulling all of pandas-profiling for people that just want to be able to get that part of it and not the whole package.
It would be nice to have that consistency, but otherwise I think it just makes more sense to build the analytical queries in spark seperately.
https://github.com/pandas-profiling/pandas-profiling/pull/670
Hi, is this implemented?
Hi, how is this progressing? I see in the 'spark development plan' that the plan is to release early December 2021, is that still the expectation?
The progress can be tracked on the github project. An alpha version is planned December 2021. In case anyone is interested in contributing, please reach out to @chanedwin (preferably via our Slack channel)
hello! yup, echoing what Simon has said, hoping to get an alpha release down sometime in December (or latest early Jan, if things get messier than expected). Please feel free to join the slack channel if you would like to get involved - always happy to discuss more there!
Appreciate the effort in getting this to work with Spark. Any updates?
Thanks for all of the work on this! Is this still in progress?
I fear this endeavor has gone stale.
I would like to be a beta tester of this feature :-)
Is someone still working on this feature?
Any idea when this would be released?
No news so far?
Looking forward to have this feature released. Also if needed a beta tester and contribute.