ydata-profiling Support for PySpark / Spark dataframes?

Support for PySpark / Spark dataframes?

Open steven-struglia opened this issue 3 years ago • 29 comments

Would be super great to have PySpark / Spark dataframe functionality for this package as our team is using Spark as our scalable backend.

Thanks so much!

Aug 11 '20 21:08 steven-struglia

Planned for end of this year. Any contributions are welcome.

Aug 11 '20 21:08 sbrugman

is there a branch for this or an implementation plan?

Aug 13 '20 16:08 skorski

@skorski There is currently no branch for this. There used to be a version of PP that was executing a Spark backend. That implementation used the pyspark.sql module to generate the statistics. Feel welcome to contribute if you'd like.

ps. (PNNL in) Richland rang a bell, turns out we passed it when visiting Gravity Hill

Aug 13 '20 18:08 sbrugman

hi all!

Have taken a stab at this @ https://github.com/pandas-profiling/pandas-profiling/wiki/Spark-Development-Plan

The plan is still a WIP, but if you guys think its a viable overall approach would be happy to keep working on this! Also, please drop me an email at [email protected] if you would like write access on the doc, happy to collaborate on this with anyone!

Aug 23 '20 15:08 chanedwin

The working document for the implementation plan can be found here: https://github.com/pandas-profiling/pandas-profiling/wiki/Spark-Development-Plan. Contributions are welcome.

(Thanks to @chanedwin)

Aug 24 '20 18:08 sbrugman

@sbrugman The implementation plan is looking really good. I'll try to dig into it a bit and help where I can. Where is the best place to post questions?

Aug 25 '20 01:08 skorski

What about leveraging koalas ?, this would be a huge shortcut?

Sep 26 '20 15:09 ahmedanis03

@skorski The Slack community for pandas-profiling can be used for that: https://join.slack.com/t/pandas-profiling/shared_invite/zt-hfy3iwp2-qEJSItye5QBZf8YGFMaMnQ

@ahmedanis03 Thank you for the suggestion, we're also considering koalas. The bulk of the work seems to be in refactoring and specific features more or less regardless of API (correlations, missing diagrams).

Sep 26 '20 15:09 sbrugman

There is already a project that was built around porting pandas-profiling to Spark: https://github.com/julioasotodv/spark-df-profiling I wonder if its code base could be of any help.

Oct 08 '20 23:10 test32443

What about simply replacing pandas with koalas? https://github.com/databricks/koalas? Maybe it is worth a shot?

Oct 10 '20 07:10 geoHeil

I used to develop a big data profiling library based on Spark and also explored for such good open source solutions. Later, when I came across pandas-profiling, I give us other solutions and have been quite happy with pandas-profiling. I have been using pandas-profiling to profile large production too. The simple trick is to randomly sample data from Spark cluster and get it to one machine for data profiling using pandas-profiling. Do we really need to profile on the whole large data? For most non-extreme metrics, the answer is no. A 100K row will likely give you accurate enough information about the population. For extreme metrics such as max, min, etc., I calculated them by myself.

If pandas-profiling is going to support profiling large data, this might be the easiest but good-enough way. Just my 2 cents.

Oct 11 '20 16:10 dclong

Is there a development branch for this if we want to contribute?

Oct 14 '20 06:10 fanzhuyifan

Hey all, is there any ongoing development for this Spark backend? We're thinking of adopting this tool, but as our datasets grow we're eventually going to have to use Spark for our profiling.

Nov 24 '20 20:11 ncoish

Hey @ncoish, yes it's ongoing ! @chanedwin has done the lion's share of the work needed for the Spark backend, which now needs to be integrated

Nov 24 '20 22:11 sbrugman

Hey @ncoish, yes it's ongoing ! @chanedwin has done the lion's share of the work needed for the Spark backend, which now needs to be integrated

Few more weeks/months of wait?

Dec 17 '20 07:12 os-datatools

I would also like to know where this code lives so I can help out. One thing I would like to put forth to be considered is being able to pull out the underlying Spark SQL queries that form the profiles into spark to avoid pulling all of pandas-profiling for people that just want to be able to get that part of it and not the whole package.

It would be nice to have that consistency, but otherwise I think it just makes more sense to build the analytical queries in spark seperately.

Mar 01 '21 16:03 kyprifog

https://github.com/pandas-profiling/pandas-profiling/pull/670

Mar 01 '21 16:03 sbrugman

Hi, is this implemented?

Oct 11 '21 05:10 jyotidhiman0610

Hi, how is this progressing? I see in the 'spark development plan' that the plan is to release early December 2021, is that still the expectation?

Nov 11 '21 14:11 araker

The progress can be tracked on the github project. An alpha version is planned December 2021. In case anyone is interested in contributing, please reach out to @chanedwin (preferably via our Slack channel)

Nov 11 '21 14:11 sbrugman

hello! yup, echoing what Simon has said, hoping to get an alpha release down sometime in December (or latest early Jan, if things get messier than expected). Please feel free to join the slack channel if you would like to get involved - always happy to discuss more there!

Nov 11 '21 17:11 chanedwin

Appreciate the effort in getting this to work with Spark. Any updates?