ydata-profiling icon indicating copy to clipboard operation
ydata-profiling copied to clipboard

Support for PySpark / Spark dataframes?

Open steven-struglia opened this issue 3 years ago • 29 comments

Would be super great to have PySpark / Spark dataframe functionality for this package as our team is using Spark as our scalable backend.

Thanks so much!

steven-struglia avatar Aug 11 '20 21:08 steven-struglia

Planned for end of this year. Any contributions are welcome.

sbrugman avatar Aug 11 '20 21:08 sbrugman

is there a branch for this or an implementation plan?

skorski avatar Aug 13 '20 16:08 skorski

@skorski There is currently no branch for this. There used to be a version of PP that was executing a Spark backend. That implementation used the pyspark.sql module to generate the statistics. Feel welcome to contribute if you'd like.

ps. (PNNL in) Richland rang a bell, turns out we passed it when visiting Gravity Hill

sbrugman avatar Aug 13 '20 18:08 sbrugman

hi all!

Have taken a stab at this @ https://github.com/pandas-profiling/pandas-profiling/wiki/Spark-Development-Plan

The plan is still a WIP, but if you guys think its a viable overall approach would be happy to keep working on this! Also, please drop me an email at [email protected] if you would like write access on the doc, happy to collaborate on this with anyone!

chanedwin avatar Aug 23 '20 15:08 chanedwin

The working document for the implementation plan can be found here: https://github.com/pandas-profiling/pandas-profiling/wiki/Spark-Development-Plan. Contributions are welcome.

(Thanks to @chanedwin)

sbrugman avatar Aug 24 '20 18:08 sbrugman

@sbrugman The implementation plan is looking really good. I'll try to dig into it a bit and help where I can. Where is the best place to post questions?

skorski avatar Aug 25 '20 01:08 skorski

What about leveraging koalas ?, this would be a huge shortcut?

ahmedanis03 avatar Sep 26 '20 15:09 ahmedanis03

@skorski The Slack community for pandas-profiling can be used for that: https://join.slack.com/t/pandas-profiling/shared_invite/zt-hfy3iwp2-qEJSItye5QBZf8YGFMaMnQ

@ahmedanis03 Thank you for the suggestion, we're also considering koalas. The bulk of the work seems to be in refactoring and specific features more or less regardless of API (correlations, missing diagrams).

sbrugman avatar Sep 26 '20 15:09 sbrugman

There is already a project that was built around porting pandas-profiling to Spark: https://github.com/julioasotodv/spark-df-profiling I wonder if its code base could be of any help.

test32443 avatar Oct 08 '20 23:10 test32443

What about simply replacing pandas with koalas? https://github.com/databricks/koalas? Maybe it is worth a shot?

geoHeil avatar Oct 10 '20 07:10 geoHeil

I used to develop a big data profiling library based on Spark and also explored for such good open source solutions. Later, when I came across pandas-profiling, I give us other solutions and have been quite happy with pandas-profiling. I have been using pandas-profiling to profile large production too. The simple trick is to randomly sample data from Spark cluster and get it to one machine for data profiling using pandas-profiling. Do we really need to profile on the whole large data? For most non-extreme metrics, the answer is no. A 100K row will likely give you accurate enough information about the population. For extreme metrics such as max, min, etc., I calculated them by myself.

If pandas-profiling is going to support profiling large data, this might be the easiest but good-enough way. Just my 2 cents.

dclong avatar Oct 11 '20 16:10 dclong

Is there a development branch for this if we want to contribute?

fanzhuyifan avatar Oct 14 '20 06:10 fanzhuyifan

Hey all, is there any ongoing development for this Spark backend? We're thinking of adopting this tool, but as our datasets grow we're eventually going to have to use Spark for our profiling.

ncoish avatar Nov 24 '20 20:11 ncoish

Hey @ncoish, yes it's ongoing ! @chanedwin has done the lion's share of the work needed for the Spark backend, which now needs to be integrated

sbrugman avatar Nov 24 '20 22:11 sbrugman

Hey @ncoish, yes it's ongoing ! @chanedwin has done the lion's share of the work needed for the Spark backend, which now needs to be integrated

Few more weeks/months of wait?

os-datatools avatar Dec 17 '20 07:12 os-datatools

I would also like to know where this code lives so I can help out. One thing I would like to put forth to be considered is being able to pull out the underlying Spark SQL queries that form the profiles into spark to avoid pulling all of pandas-profiling for people that just want to be able to get that part of it and not the whole package.

It would be nice to have that consistency, but otherwise I think it just makes more sense to build the analytical queries in spark seperately.

kyprifog avatar Mar 01 '21 16:03 kyprifog

https://github.com/pandas-profiling/pandas-profiling/pull/670

sbrugman avatar Mar 01 '21 16:03 sbrugman

Hi, is this implemented?

jyotidhiman0610 avatar Oct 11 '21 05:10 jyotidhiman0610

Hi, how is this progressing? I see in the 'spark development plan' that the plan is to release early December 2021, is that still the expectation?

araker avatar Nov 11 '21 14:11 araker

The progress can be tracked on the github project. An alpha version is planned December 2021. In case anyone is interested in contributing, please reach out to @chanedwin (preferably via our Slack channel)

sbrugman avatar Nov 11 '21 14:11 sbrugman

hello! yup, echoing what Simon has said, hoping to get an alpha release down sometime in December (or latest early Jan, if things get messier than expected). Please feel free to join the slack channel if you would like to get involved - always happy to discuss more there!

chanedwin avatar Nov 11 '21 17:11 chanedwin

Appreciate the effort in getting this to work with Spark. Any updates?

DataRx avatar Jan 21 '22 14:01 DataRx

Thanks for all of the work on this! Is this still in progress?

skorski avatar Apr 06 '22 15:04 skorski

I fear this endeavor has gone stale.

DataRx avatar Apr 08 '22 18:04 DataRx

I would like to be a beta tester of this feature :-)

daherk2 avatar Apr 28 '22 12:04 daherk2

Is someone still working on this feature?

prajal55 avatar May 22 '22 18:05 prajal55

Any idea when this would be released?

shriharimundada avatar May 23 '22 16:05 shriharimundada

No news so far?

Gexar avatar Jun 15 '22 07:06 Gexar

Looking forward to have this feature released. Also if needed a beta tester and contribute.

eyaldahari avatar Jul 26 '22 14:07 eyaldahari

Can we use Koalas as a "backend"?

EDIT: PySpark now has built in Pandas-like interface

stormbeforesunsetbee avatar Nov 27 '22 08:11 stormbeforesunsetbee