ydata-profiling icon indicating copy to clipboard operation
ydata-profiling copied to clipboard

Initialize the Profiling UI with an "external" json

Open alexlang74 opened this issue 2 years ago • 8 comments

Missing functionality

I'd like to be able to use the pandas profiling UI on a pre-existing profiling output that I have in json format. This avoids having to "re-run" profiling. There is the functionality to have the profiling report as HTML, and save that. However, if the profiling generation and the profiling visualization were further decoupled, it would allow me to pass in a json that I may have generated (or refreshed) by other means, maybe outside pandas_profiling. It's clear that in this case, I'm responsible for providing a proper json to have the visualization work properly...

Proposed feature

By being able to set the "cache" with an existing json of a predefined schema.

Alternatives considered

No response

Additional context

No response

alexlang74 avatar Jan 25 '23 19:01 alexlang74

Can I work on this one? I am pretty new to this

jalajk24 avatar Jan 29 '23 20:01 jalajk24

@alexlang74 can you describe a bit more your scenario with maybe a minimal example in terms of interface?

Do you want to run the profiling without generating any output, then serializing the ProfileReport object in JSON to be able to deserializing after to a ProfileReport?

(btw, we might have worked together briefly when I was at IBM Krakow :D)

aquemy avatar Jan 30 '23 14:01 aquemy

Hi @aquemy , nice to hear from a former colleague.-)

I have the following in mind:

profile = ProfileReport(myDf)
jsonRes = profile.to_json()
...
newProfile = ProfileReport.from_json(jsonRes)
newProfile.to_notebook_iframe()

alexlang74 avatar Jan 30 '23 20:01 alexlang74

You wouldn't really do this within the same Python file / Notebook. It's as you said: One could serialize the json output, and pull it into a new Profile Report. There, one has then the flexibility to render it differently (as iFrame, as html,...). One could even use it to compare different versions of the data set over time, by keeping the json around, and then using the recently introduced comparison capabilities...

alexlang74 avatar Jan 30 '23 20:01 alexlang74

@alexlang74 Thank you for the example.

We have a workaround for now if pickle is acceptable.

Serialization:

profile = ProfileReport(df,)
profile.to_file('report.html')  # Trigger the computation / alternative you can use profile.to_json() for no file output
profile.dump('my_report') # Serialize in pickle to my_report.pp

Deserialization:

loaded_profile = ProfileReport().load('my_report.pp')  # notice that you have to instantiate an empty instance of ProfileReport

loaded_profile will contain exactly the same information as the original object.

If you try to compare with the deserialized version, it will raise ValueError: Reports where not initialized with a DataFrame. because comparing requires at least the schema (because we compare only the columns that are present in both datasets).

Another workaround for that would be to at least specify the columns:

loaded_profile.df = df.head(1)  # or empty but with the columns + proper dtypes

I hope it helps!

We surely should decouple the report computation from the report generation and allow for proper serialization.

aquemy avatar Jan 31 '23 08:01 aquemy

Thank you for the prompt response, I'll try that out!

We surely should decouple the report computation from the report generation and allow for proper serialization.

Glad that you agree with my goal. This could also help in having other tools "contribute" to the computation, and have ydata-profiling as the UI experience

alexlang74 avatar Jan 31 '23 09:01 alexlang74

Can I work on this one? I am pretty new to this

Hi @jalajk24 ,

it is great that you want to contribute to the package. We have the roadmap open here, feel free to pick one that is not already taken :)

Let me know if you need any support!

fabclmnt avatar Feb 01 '23 05:02 fabclmnt

This worked for me.

Can we compare 2 loaded profiles? I am getting below error: ValueError: Reports where not initialized with a DataFrame.

capnomad avatar Jun 07 '23 19:06 capnomad