astronomer-providers icon indicating copy to clipboard operation
astronomer-providers copied to clipboard

Implement usage metrics of Operators/Sensors

Open bharanidharan14 opened this issue 2 years ago • 23 comments

Add module path and inheritance(sub-class) details to scheduler_job.py log info, so that astro cloud can pick up the splunk logs to show the usage count of the Operator and sensor

https://www.notion.so/astronomerio/Approach-to-find-usage-metrics-of-Operators-Sensors-947dd9d9968444fe984ee34ef6c4a420

bharanidharan14 avatar Jul 12 '22 15:07 bharanidharan14

Created PR in OSS Airflow for adding op_classpath in schedular log, in order to track where the operator/sensor originated from

#25309

bharanidharan14 avatar Jul 26 '22 13:07 bharanidharan14

Slack conversation on the adding op_classpath in scheduler_job.py where the logs are generated, then Astro cloud could pick up the info. PR raised in OSS

https://astronomer.slack.com/archives/C02PABPU6B0/p1658846250206209

bharanidharan14 avatar Jul 28 '22 07:07 bharanidharan14

As per discussion over slack, @steveyz-astro and @ashb confirmed all the worker log get ingested into Splunk. Need to look into metadata about the task instance log, if the task instance log has the class path we collate to get metrics, if doesn't have class path need to add it in worker log

bharanidharan14 avatar Jul 29 '22 06:07 bharanidharan14

Need to connect with @ashb regarding the log, instead making changes to OSS Airflow, need to try out to get the log from Runtime

bharanidharan14 avatar Aug 01 '22 06:08 bharanidharan14

Connected with @ashb and @chris decided to follow the steps to get the usage metric without making any changes in OSS

  1. Ask OL team if we can do an ad-hoc query against the DB
  2. Explore timeline for getting OL data into DW
  3. Change operator field in Airflow to be full classpath, not just class name (and work out impact on OSS code)
  4. Add new log line to Runtime worker output (note worker logs not task logs!)

bharanidharan14 avatar Aug 03 '22 04:08 bharanidharan14

Connected with @julienledem regarding adding ad-hoc query against the DB, he mentioned we can able to do query for openLineage but it's going to be per org, but this has only class name and doesn't have the class path(it can be added if needed), he also pointed me to the existing operator usage dashboard from @shillion ’s team: https://app.sigmacomputing.com/astronomer/workbook/Astro-Customer-Usage-Dashboard-2LB0JYkylKlxgJtXlRCdpU?:nodeId=3_lTDNxmJY

bharanidharan14 avatar Aug 03 '22 05:08 bharanidharan14

Got the access to for the DWH and DWH_DEV DB from @chris

bharanidharan14 avatar Aug 04 '22 04:08 bharanidharan14

As per discussion in this thread, currently they don't have openlineage data in Astro GCS or snowflake DB they have RDS instance that has all the individual org database instances, they have plan to get the data into Astro GCS from there to snowflake. I am following up on that.

once we get the data into DWH snowflake we can able to pick up the classpath from the openlineage

bharanidharan14 avatar Aug 04 '22 04:08 bharanidharan14

@bharanidharan14 to list discussions and thoughts on https://www.notion.so/astronomerio/Approach-to-find-usage-metrics-of-Operators-Sensors-947dd9d9968444fe984ee34ef6c4a420 or decide if we need a new notion page.

pankajkoti avatar Aug 04 '22 06:08 pankajkoti

Connected with @Jed Cunningham over the slack. He suggested to use the new listener in our case https://airflow.apache.org/docs/apache-airflow/stable/listeners.html . So we add up a listener in astro runtime here (https://github.com/astronomer/astro-runtime/blob/main/package/astronomer/runtime/plugin.py ) to log the class path, so that the logs get into the splunk as well.

Need to try what @jedcunningham suggested

bharanidharan14 avatar Aug 09 '22 04:08 bharanidharan14

Screenshot 2022-08-09 at 7.26.29 PM.pngI can able to get the class path by adding plugin and listener in local with task instance details

bharanidharan14 avatar Aug 09 '22 14:08 bharanidharan14

Made changes to astro runtime repo, currently I don't have write access to it and posted in astro-runtime channel

bharanidharan14 avatar Aug 10 '22 06:08 bharanidharan14

Raised PR for adding listener to the astro-runtime repo, this Listener will get the task instance details and log those details, so that the data team can capture this log into table.

PR: https://github.com/astronomer/astro-runtime/pull/349

bharanidharan14 avatar Aug 11 '22 11:08 bharanidharan14

Requested @jedcunningham to review the PR and raised the same concern in astro-runtime channel

bharanidharan14 avatar Aug 17 '22 06:08 bharanidharan14

PR got merged into astro-runtime repo, currently working on test case and testing it in dev.

bharanidharan14 avatar Aug 19 '22 00:08 bharanidharan14

Asked @astronaut-chris on the timeline of getting this log into Splunk, these was his response and his planned item he has to work on.

  • A new runtime would have to be released, where your TaskInstance Details log messages would start being emitted
  • A Splunk query would have to be written/agreed upon, that could get the distinct TaskInstance Details log messages over a given period
  • We would add an ingestion task to our Splunk DAG
  • We would join these details into our task_runs table.

Screen Shot 2022-08-18 at 7.19.08 AM.png

bharanidharan14 avatar Aug 19 '22 00:08 bharanidharan14

@bharanidharan14 Whats the latest on this ticket? Can you ensure that this is updated as soon as we progress on the task.

phanikumv avatar Sep 07 '22 10:09 phanikumv

Connected with @astronaut-chris regarding the task once the latest astro-runtime image is release, he should be able to pick the log and star on the tickets DT-587

bharanidharan14 avatar Sep 14 '22 06:09 bharanidharan14

Data team ticket : https://astronomer.atlassian.net/browse/DT-587

bharanidharan14 avatar Sep 16 '22 10:09 bharanidharan14

Waiting on the changes by Data team.

phanikumv avatar Sep 26 '22 11:09 phanikumv

Hello! Even though it appears that there are many deployments that have adopted the 6.x runtime, it looks like we're only picking up events in dev. Please see below.

image

For context, we pick up TaskInstance Finished logs from the prod-schedulers index.

Can you dig into why this is not exporting to the correct index?

ghost avatar Sep 26 '22 11:09 ghost

Connected with @astronaut-chris, we tried to get some information out @rossturk POC lineage data in the sandbox as an early analysis, we can able to get the whole classpath, maximum used operator and their the parent class, operator name by connecting to the table.

I have updated the query and graph we got by running query on POC lineage data https://www.notion.so/astronomerio/Approach-to-find-usage-metrics-of-Operators-Sensors-947dd9d9968444fe984ee34ef6c4a420

https://github.com/astronomer/analyses/blob/main/analyses/needed-async-operators/needed-async-operators.R

bharanidharan14 avatar Sep 28 '22 07:09 bharanidharan14

Connected with @uranusjr regarding how to achieve the first-level inheritance of the custom operator, so based on the discussion with him he suggested some of the steps.

  • Try to add a field in the task instance before the operator is serialized which gets the inheritance chain module as a string and captured into the task instance table.
  • Try this as POC and make sure whether we get this info in the TaskInstance table and this data will be reflected in the Splunk

bharanidharan14 avatar Oct 19 '22 06:10 bharanidharan14

Created PR: https://github.com/astronomer/astro-runtime/pull/524

bharanidharan14 avatar Nov 30 '22 07:11 bharanidharan14