argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

Resource template logs for spark application don't get archived in artifact repo

Open Freia3 opened this issue 1 year ago • 7 comments

Pre-requisites

  • [X] I have double-checked my configuration
  • [X] I can confirm the issues exists when I tested with :latest
  • [ ] I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

I have an Argo Workflow running a Spark application (using the spark-operator). I want to archive the logs of this workflow in an artifact repository, but this does not work.

When running the hello-world workflow, the logs do get archived. yaml files to reproduce: https://github.com/Freia3/argo-spark-example

Version

v3.4.2

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: spark-kubernetes-dag
  namespace: freia
spec:
  entrypoint: sparkling-operator
  serviceAccountName: argo-spark
  templates:
  - name: sparkpi
    resource: 
      action: create 
      successCondition: status.applicationState.state in (COMPLETED)
      failureCondition: 'status.applicationState.state in (FAILED, SUBMISSION_FAILED, UNKNOWN)'
      manifest: | 
        apiVersion: "sparkoperator.k8s.io/v1beta2"
        kind: SparkApplication
        metadata:
          generateName: spark-pi
          namespace: freia
        spec:
          type: Scala
          mode: cluster
          image: "gcr.io/spark-operator/spark:v3.0.0"
          imagePullPolicy: Always
          mainClass: org.apache.spark.examples.SparkPi
          mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0.jar"
          sparkVersion: "3.0.0"
          restartPolicy:
            type: Never
          driver:
            memory: "512m"
            labels:
              version: 3.0.0
            serviceAccount: my-release-spark
          executor:
            instances: 1
            memory: "512m"
            labels:
              version: 3.0.0
  - name: sparkling-operator
    dag:
      tasks:
      - name: SparkPi1
        template: sparkpi

Logs from the workflow controller

time="2022-10-24T14:23:06.533Z" level=info msg="Processing workflow" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.540Z" level=info msg="Updated phase  -> Running" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.540Z" level=info msg="DAG node spark-kubernetes-dagbkjgn initialized Running" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.540Z" level=info msg="All of node spark-kubernetes-dagbkjgn.SparkPi1 dependencies [] completed" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.540Z" level=info msg="Pod node spark-kubernetes-dagbkjgn-3694106157 initialized Pending" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.601Z" level=info msg="Created pod: spark-kubernetes-dagbkjgn.SparkPi1 (spark-kubernetes-dagbkjgn-sparkpi-3694106157)" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.602Z" level=info msg="TaskSet Reconciliation" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.602Z" level=info msg=reconcileAgentPod namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.616Z" level=info msg="Workflow update successful" namespace=freia phase=Running resourceVersion=27138244 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:16.603Z" level=info msg="Processing workflow" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:16.603Z" level=info msg="Task-result reconciliation" namespace=freia numObjs=0 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:16.603Z" level=info msg="node changed" namespace=freia new.message= new.phase=Running new.progress=0/1 nodeID=spark-kubernetes-dagbkjgn-3694106157 old.message= old.phase=Pending old.progress=0/1 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:16.604Z" level=info msg="TaskSet Reconciliation" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:16.604Z" level=info msg=reconcileAgentPod namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:16.634Z" level=info msg="Workflow update successful" namespace=freia phase=Running resourceVersion=27138341 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:26.635Z" level=info msg="Processing workflow" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:26.635Z" level=info msg="Task-result reconciliation" namespace=freia numObjs=0 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:26.635Z" level=info msg="node unchanged" namespace=freia nodeID=spark-kubernetes-dagbkjgn-3694106157 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:26.635Z" level=info msg="TaskSet Reconciliation" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:26.635Z" level=info msg=reconcileAgentPod namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Processing workflow" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Task-result reconciliation" namespace=freia numObjs=0 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="node changed" namespace=freia new.message= new.phase=Succeeded new.progress=0/1 nodeID=spark-kubernetes-dagbkjgn-3694106157 old.message= old.phase=Running old.progress=0/1 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Outbound nodes of spark-kubernetes-dagbkjgn set to [spark-kubernetes-dagbkjgn-3694106157]" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="node spark-kubernetes-dagbkjgn phase Running -> Succeeded" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="node spark-kubernetes-dagbkjgn finished: 2022-10-24 14:27:02.049826737 +0000 UTC" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Checking daemoned children of spark-kubernetes-dagbkjgn" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="TaskSet Reconciliation" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg=reconcileAgentPod namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Updated phase Running -> Succeeded" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Marking workflow completed" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Checking daemoned children of " namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.055Z" level=info msg="cleaning up pod" action=deletePod key=freia/spark-kubernetes-dagbkjgn-1340600742-agent/deletePod
time="2022-10-24T14:27:02.067Z" level=info msg="Workflow update successful" namespace=freia phase=Succeeded resourceVersion=27139900 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.078Z" level=info msg="cleaning up pod" action=labelPodCompleted key=freia/spark-kubernetes-dagbkjgn-sparkpi-3694106157/labelPodCompleted

Logs from in your workflow's wait container

No resources found in argo namespace.

Freia3 avatar Oct 24 '22 14:10 Freia3

Your resource needs some metadata, take a look at the example:

https://github.com/argoproj/argo-workflows/blob/master/examples/k8s-resource-log-selector.yaml

ajkaanbal avatar Oct 24 '22 14:10 ajkaanbal

@ajkaanbal This is for pulling the logs from the pods created by the spark CRD (spark-driver, spark-executor) In the Argo UI I see these logs: image I want to be able to archive those logs.

Freia3 avatar Oct 24 '22 15:10 Freia3

@Freia3 Current Resource template will not support archiving the log. Do you like to work on this enhancement?

sarabala1979 avatar Oct 31 '22 17:10 sarabala1979

@sarabala1979 Ok, thanks for the information, couldn't find this in the docs. No, I can't work on this enhancement.

Freia3 avatar Oct 31 '22 21:10 Freia3

@sarabala1979 Hello I wish to contribute on this one, It is relevant for my team and I believe that the fix is straightforward.

From what Ive seen there are to ways to solve it,

  1. Make the argoexec resource store the logs as artifact using executor.WorkflowExecutor.SaveLogs - I still need to check if it will be able to store its own container logs while still running
  2. Initiate wait container for resource pods which by design stores logs of the main container here - I worry about redundant error reporting but I think it is safe

tbh I think 2 is a better option, wdyt?

arnoin avatar Nov 15 '23 12:11 arnoin

I'm probably in agreement about 2 being the right way to do it. @sarabala1979, can you pitch in?

Joibel avatar Nov 22 '23 09:11 Joibel

Hey @sarabala1979 do you want me to create pull request?

arnoin avatar Dec 31 '23 13:12 arnoin