cwl-airflow icon indicating copy to clipboard operation
cwl-airflow copied to clipboard

provenance

Open mdrio opened this issue 3 years ago • 3 comments

Is your feature request related to a problem? Please describe. Hi, I need to track the provenance of artifacts produced by worklflows.

Describe the solution you'd like The workflow report contains only information about the output, it would be great to have the associate the related cwl and its inputs. Is there an easy way to obtain that?

Thanks

mdrio avatar Feb 17 '21 13:02 mdrio

Hi @mdrio, sorry for the late reply. All the information about the workflow execution is stored in the Airflow metadata database, which you can access from the Airflow UI or directly. We use Xcom to store location of the JSON file with outputs of each step. Also, if you add in your airflow.cfg file the following section

[cwl]
keep_tmp_data = true

the system won't even delete any temporary data between the steps. The information about the workflow is basically DAG ID in the context of Airflow. We assume that for each new workflow you will add a separate DAG. Each specific workflow execution is DagRun in the context of Airflow. For each DagRun you can get information about the parameters it was triggered with. Those parameters are available through Airflow UI as the configuration for the DagRun. We also report workflow execution statistics such as time for each step, disk usage for temporary and output files. Let me know if you need any additional information about it.

michael-kotliar avatar Mar 15 '21 14:03 michael-kotliar

Hi @michael-kotliar, thanks for the reply. How the statistics gathering works? I see the a connection has to be created in order to receive the data, but which service or API is expected to be called? Is the workflow report also included in the data POSTed?

mdrio avatar May 11 '21 08:05 mdrio

Hi @mdrio, We post all collected statistics as part of the progress report. We trigger POST as Task's or DAG's on_success/on_failure callbacks, so you don't need to have CWL-Airflow API running.

The endpoints are defined as

CONN_ID = "process_report"
ROUTES = {
    "progress": "airflow/progress",
    "results":  "airflow/results",
    "status":   "airflow/status"
}

Where process_report connection should be created in Airflow.

Please, see more details here https://cwl-airflow.readthedocs.io/en/latest/readme/how_to_use.html#posting-pipeline-execution-progress-statistics-and-results

Let me know if it helps.

michael-kotliar avatar Jun 01 '21 17:06 michael-kotliar