data-validation
data-validation copied to clipboard
Get formatted schema and anomalies to visualize
I'm trying to run tfdv process in Kubeflow Pipeline and visualize the results in the pipeline UI.
For statistics, I can easily visualize using get_statistics_html
.
However, for schema and anomalies, I was struggled. We have display_schema
and display_anomalies
function, but it transforms data and calls IPython display inside. So, we have no way to get visualizable formatted data.
Eventually, I almost copied the display functions and change those to return DataFrame.
FYI, the code is like this.
def _transform_anormalies_to_df(anomalies) -> pd.DataFrame:
anomaly_rows = []
for feature_name, anomaly_info in anomalies.anomaly_info.items():
anomaly_rows.append(
[
display_util._add_quotes(feature_name),
anomaly_info.short_description,
anomaly_info.description,
]
)
if anomalies.HasField("dataset_anomaly_info"):
anomaly_rows.append(
[
"[dataset anomaly]",
anomalies.dataset_anomaly_info.short_description,
anomalies.dataset_anomaly_info.description,
]
)
if not anomaly_rows:
logging.info("No anomalies found.")
return None
else:
logging.warning(f"{len(anomaly_rows)} anomalies found.")
anomalies_df = pd.DataFrame(
anomaly_rows,
columns=[
"Feature name",
"Anomaly short description",
"Anomaly long description",
],
)
return anomalies_df
def main(schema_file: str, stats_file: str, anomalies_file: str):
schema = tfdv.load_schema_text(schema_file)
stats = tfdv.load_statistics(stats_file)
anomalies = tfdv.validate_statistics(statistics=stats, schema=schema)
tfdv.write_anomalies_text(anomalies, anomalies_file)
anomalies_df = _transform_anormalies_to_df(anomalies)
if anomalies_df is not None:
metadata = {
"outputs": [
{
"type": "table",
"storage": "inline",
"format": "csv",
"header": anomalies_df.columns.tolist(),
"source": anomalies_df.to_csv(header=False, index=False),
},
]
}
with open("/mlpipeline-ui-metadata.json", "w") as f:
json.dump(metadata, f)
Does someone know any other good way? What do you think about separate the display function for the transforming function and visualizing function like the function for statistics?
What do you mean by "visualizeable formatted" data?
The schema and stats are protocol buffer [1] objects. They implemented __str__
so if you print()
them, you'll get a Protobuf Text Format [2] which is intended for human to read. Internally at Google, our users reviews and modifies the text format schema.
[1] https://developers.google.com/protocol-buffers [2] https://googleapis.dev/python/protobuf/latest/google/protobuf/text_format.html (sorry, the spec for the TextFormat is not open-source).
Thanks, @brills, and sorry, my writing was bad. "visualizable formatted" data just mean table format like dataframe created in display_schema or display_anomalies.
I want to visualize the result like this on Kubeflow Pipeline.
For that, I want to get the dataframe created in display_anomalies
. Of course, I can create it by myself in the same way as display_anomalies
does, but I feel implementing the same logic is a waste of time. So, for example, it is helpful for me if display_anomalies
returns the dataframe.
Thanks for the clarification.
We noted it in our internal bug tracker. What you suggested makes sense to me. But I'll check w/ the Kubeflow team to understand what their UI is capable of displaying first.
In the meanwhile please keep using your "hack". As you can see, that piece of logic has been stable (and the part it extracts from the schema also has been stable).
I understand. Thanks!
A vote of support for this feature.
I was trying to do exactly the same thing – DataFrames are much easier to work with than protos, especially for visualization in JS.
I also ended up copying the display_schema()
and display_anomalies()
code, but a proper library function would be great!