airflow icon indicating copy to clipboard operation
airflow copied to clipboard

update pattern for dataflow job id extraction

Open lukas-mi opened this issue 6 months ago • 0 comments

Dataflow job id is extracted from the logged output of java process that starts the Dataflow job, for example, in case of BeamRunJavaPipelineOperator.

Currently job id pattern matches characters until first " or \n is encountered, which is fine for a following case:

  • logged line: [2024-08-27 11:20:22,094] INFO Submitted job: 2024-08-27_04_20_21-7947372725816706151
  • extracted job id: 2024-08-27_04_20_21-7947372725816706151

However, if the logger is configured differently, for example, has a whitespace and a suffix at the end with additional information, the pattern extracts the id together with the suffix:

  • logged line: [2024-08-27 11:20:22,094] INFO Submitted job: 2024-08-27_04_20_21-7947372725816706151 (org.apache.beam.runners.dataflow.DataflowRunner) (main)
  • extracted job id: 2024-08-27_04_20_21-7947372725816706151 (org.apache.beam.runners.dataflow.DataflowRunner) (main)

In the previous example suffix (org.apache.beam.runners.dataflow.DataflowRunner) (main) should not be extracted as part of the job id.

I updated the pattern by adding the whitespace character \s (along side existing " and \n), indicating the end of job id.


^ Add meaningful description above Read the Pull Request Guidelines for more information. In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed. In case of a new dependency, check compliance with the ASF 3rd Party License Policy. In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

lukas-mi avatar Aug 27 '24 14:08 lukas-mi