ydata-profiling
ydata-profiling copied to clipboard
Initialize the Profiling UI with an "external" json
Missing functionality
I'd like to be able to use the pandas profiling UI on a pre-existing profiling output that I have in json format. This avoids having to "re-run" profiling. There is the functionality to have the profiling report as HTML, and save that. However, if the profiling generation and the profiling visualization were further decoupled, it would allow me to pass in a json that I may have generated (or refreshed) by other means, maybe outside pandas_profiling. It's clear that in this case, I'm responsible for providing a proper json to have the visualization work properly...
Proposed feature
By being able to set the "cache" with an existing json of a predefined schema.
Alternatives considered
No response
Additional context
No response
Can I work on this one? I am pretty new to this
@alexlang74 can you describe a bit more your scenario with maybe a minimal example in terms of interface?
Do you want to run the profiling without generating any output, then serializing the ProfileReport object in JSON to be able to deserializing after to a ProfileReport?
(btw, we might have worked together briefly when I was at IBM Krakow :D)
Hi @aquemy , nice to hear from a former colleague.-)
I have the following in mind:
profile = ProfileReport(myDf)
jsonRes = profile.to_json()
...
newProfile = ProfileReport.from_json(jsonRes)
newProfile.to_notebook_iframe()
You wouldn't really do this within the same Python file / Notebook. It's as you said: One could serialize the json output, and pull it into a new Profile Report. There, one has then the flexibility to render it differently (as iFrame, as html,...). One could even use it to compare different versions of the data set over time, by keeping the json around, and then using the recently introduced comparison capabilities...
@alexlang74 Thank you for the example.
We have a workaround for now if pickle is acceptable.
Serialization:
profile = ProfileReport(df,)
profile.to_file('report.html') # Trigger the computation / alternative you can use profile.to_json() for no file output
profile.dump('my_report') # Serialize in pickle to my_report.pp
Deserialization:
loaded_profile = ProfileReport().load('my_report.pp') # notice that you have to instantiate an empty instance of ProfileReport
loaded_profile will contain exactly the same information as the original object.
If you try to compare with the deserialized version, it will raise ValueError: Reports where not initialized with a DataFrame. because comparing requires at least the schema (because we compare only the columns that are present in both datasets).
Another workaround for that would be to at least specify the columns:
loaded_profile.df = df.head(1) # or empty but with the columns + proper dtypes
I hope it helps!
We surely should decouple the report computation from the report generation and allow for proper serialization.
Thank you for the prompt response, I'll try that out!
We surely should decouple the report computation from the report generation and allow for proper serialization.
Glad that you agree with my goal. This could also help in having other tools "contribute" to the computation, and have ydata-profiling as the UI experience
Can I work on this one? I am pretty new to this
Hi @jalajk24 ,
it is great that you want to contribute to the package. We have the roadmap open here, feel free to pick one that is not already taken :)
Let me know if you need any support!
This worked for me.
Can we compare 2 loaded profiles?
I am getting below error:
ValueError: Reports where not initialized with a DataFrame.