data-validation Get formatted schema and anomalies to visualize

I'm trying to run tfdv process in Kubeflow Pipeline and visualize the results in the pipeline UI.

For statistics, I can easily visualize using get_statistics_html.

However, for schema and anomalies, I was struggled. We have display_schema and display_anomalies function, but it transforms data and calls IPython display inside. So, we have no way to get visualizable formatted data. Eventually, I almost copied the display functions and change those to return DataFrame.

FYI, the code is like this.

def _transform_anormalies_to_df(anomalies) -> pd.DataFrame:
    anomaly_rows = []
    for feature_name, anomaly_info in anomalies.anomaly_info.items():
        anomaly_rows.append(
            [
                display_util._add_quotes(feature_name),
                anomaly_info.short_description,
                anomaly_info.description,
            ]
        )
    if anomalies.HasField("dataset_anomaly_info"):
        anomaly_rows.append(
            [
                "[dataset anomaly]",
                anomalies.dataset_anomaly_info.short_description,
                anomalies.dataset_anomaly_info.description,
            ]
        )

    if not anomaly_rows:
        logging.info("No anomalies found.")
        return None
    else:
        logging.warning(f"{len(anomaly_rows)} anomalies found.")
        anomalies_df = pd.DataFrame(
            anomaly_rows,
            columns=[
                "Feature name",
                "Anomaly short description",
                "Anomaly long description",
            ],
        )
        return anomalies_df


def main(schema_file: str, stats_file: str, anomalies_file: str):
    schema = tfdv.load_schema_text(schema_file)
    stats = tfdv.load_statistics(stats_file)
    anomalies = tfdv.validate_statistics(statistics=stats, schema=schema)
    tfdv.write_anomalies_text(anomalies, anomalies_file)

    anomalies_df = _transform_anormalies_to_df(anomalies)
    if anomalies_df is not None:
        metadata = {
            "outputs": [
                {
                    "type": "table",
                    "storage": "inline",
                    "format": "csv",
                    "header": anomalies_df.columns.tolist(),
                    "source": anomalies_df.to_csv(header=False, index=False),
                },
            ]
        }
        with open("/mlpipeline-ui-metadata.json", "w") as f:
            json.dump(metadata, f)

Does someone know any other good way? What do you think about separate the display function for the transforming function and visualizing function like the function for statistics?

Nov 19 '20 08:11 wakanapo

What do you mean by "visualizeable formatted" data?

The schema and stats are protocol buffer [1] objects. They implemented __str__ so if you print() them, you'll get a Protobuf Text Format [2] which is intended for human to read. Internally at Google, our users reviews and modifies the text format schema.

[1] https://developers.google.com/protocol-buffers [2] https://googleapis.dev/python/protobuf/latest/google/protobuf/text_format.html (sorry, the spec for the TextFormat is not open-source).

Nov 23 '20 17:11 brills

Thanks, @brills, and sorry, my writing was bad. "visualizable formatted" data just mean table format like dataframe created in display_schema or display_anomalies.

I want to visualize the result like this on Kubeflow Pipeline. スクリーンショット 2020-11-20 16 02 51

For that, I want to get the dataframe created in display_anomalies. Of course, I can create it by myself in the same way as display_anomalies does, but I feel implementing the same logic is a waste of time. So, for example, it is helpful for me if display_anomalies returns the dataframe.

Nov 24 '20 01:11 wakanapo

Thanks for the clarification.

We noted it in our internal bug tracker. What you suggested makes sense to me. But I'll check w/ the Kubeflow team to understand what their UI is capable of displaying first.

In the meanwhile please keep using your "hack". As you can see, that piece of logic has been stable (and the part it extracts from the schema also has been stable).

Nov 24 '20 01:11 brills

I understand. Thanks!

Nov 24 '20 02:11 wakanapo

A vote of support for this feature.

I was trying to do exactly the same thing – DataFrames are much easier to work with than protos, especially for visualization in JS.

I also ended up copying the display_schema() and display_anomalies() code, but a proper library function would be great!

Dec 04 '20 06:12 kennysong