argo-workflows
argo-workflows copied to clipboard
Resource template logs for spark application don't get archived in artifact repo
Pre-requisites
- [X] I have double-checked my configuration
- [X] I can confirm the issues exists when I tested with
:latest
- [ ] I'd like to contribute the fix myself (see contributing guide)
What happened/what you expected to happen?
I have an Argo Workflow running a Spark application (using the spark-operator). I want to archive the logs of this workflow in an artifact repository, but this does not work.
When running the hello-world workflow, the logs do get archived. yaml files to reproduce: https://github.com/Freia3/argo-spark-example
Version
v3.4.2
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: spark-kubernetes-dag
namespace: freia
spec:
entrypoint: sparkling-operator
serviceAccountName: argo-spark
templates:
- name: sparkpi
resource:
action: create
successCondition: status.applicationState.state in (COMPLETED)
failureCondition: 'status.applicationState.state in (FAILED, SUBMISSION_FAILED, UNKNOWN)'
manifest: |
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
generateName: spark-pi
namespace: freia
spec:
type: Scala
mode: cluster
image: "gcr.io/spark-operator/spark:v3.0.0"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0.jar"
sparkVersion: "3.0.0"
restartPolicy:
type: Never
driver:
memory: "512m"
labels:
version: 3.0.0
serviceAccount: my-release-spark
executor:
instances: 1
memory: "512m"
labels:
version: 3.0.0
- name: sparkling-operator
dag:
tasks:
- name: SparkPi1
template: sparkpi
Logs from the workflow controller
time="2022-10-24T14:23:06.533Z" level=info msg="Processing workflow" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.540Z" level=info msg="Updated phase -> Running" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.540Z" level=info msg="DAG node spark-kubernetes-dagbkjgn initialized Running" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.540Z" level=info msg="All of node spark-kubernetes-dagbkjgn.SparkPi1 dependencies [] completed" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.540Z" level=info msg="Pod node spark-kubernetes-dagbkjgn-3694106157 initialized Pending" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.601Z" level=info msg="Created pod: spark-kubernetes-dagbkjgn.SparkPi1 (spark-kubernetes-dagbkjgn-sparkpi-3694106157)" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.602Z" level=info msg="TaskSet Reconciliation" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.602Z" level=info msg=reconcileAgentPod namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:06.616Z" level=info msg="Workflow update successful" namespace=freia phase=Running resourceVersion=27138244 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:16.603Z" level=info msg="Processing workflow" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:16.603Z" level=info msg="Task-result reconciliation" namespace=freia numObjs=0 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:16.603Z" level=info msg="node changed" namespace=freia new.message= new.phase=Running new.progress=0/1 nodeID=spark-kubernetes-dagbkjgn-3694106157 old.message= old.phase=Pending old.progress=0/1 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:16.604Z" level=info msg="TaskSet Reconciliation" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:16.604Z" level=info msg=reconcileAgentPod namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:16.634Z" level=info msg="Workflow update successful" namespace=freia phase=Running resourceVersion=27138341 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:26.635Z" level=info msg="Processing workflow" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:26.635Z" level=info msg="Task-result reconciliation" namespace=freia numObjs=0 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:26.635Z" level=info msg="node unchanged" namespace=freia nodeID=spark-kubernetes-dagbkjgn-3694106157 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:26.635Z" level=info msg="TaskSet Reconciliation" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:23:26.635Z" level=info msg=reconcileAgentPod namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Processing workflow" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Task-result reconciliation" namespace=freia numObjs=0 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="node changed" namespace=freia new.message= new.phase=Succeeded new.progress=0/1 nodeID=spark-kubernetes-dagbkjgn-3694106157 old.message= old.phase=Running old.progress=0/1 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Outbound nodes of spark-kubernetes-dagbkjgn set to [spark-kubernetes-dagbkjgn-3694106157]" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="node spark-kubernetes-dagbkjgn phase Running -> Succeeded" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="node spark-kubernetes-dagbkjgn finished: 2022-10-24 14:27:02.049826737 +0000 UTC" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Checking daemoned children of spark-kubernetes-dagbkjgn" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="TaskSet Reconciliation" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg=reconcileAgentPod namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Updated phase Running -> Succeeded" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Marking workflow completed" namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.049Z" level=info msg="Checking daemoned children of " namespace=freia workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.055Z" level=info msg="cleaning up pod" action=deletePod key=freia/spark-kubernetes-dagbkjgn-1340600742-agent/deletePod
time="2022-10-24T14:27:02.067Z" level=info msg="Workflow update successful" namespace=freia phase=Succeeded resourceVersion=27139900 workflow=spark-kubernetes-dagbkjgn
time="2022-10-24T14:27:02.078Z" level=info msg="cleaning up pod" action=labelPodCompleted key=freia/spark-kubernetes-dagbkjgn-sparkpi-3694106157/labelPodCompleted
Logs from in your workflow's wait container
No resources found in argo namespace.
Your resource needs some metadata, take a look at the example:
https://github.com/argoproj/argo-workflows/blob/master/examples/k8s-resource-log-selector.yaml
@ajkaanbal This is for pulling the logs from the pods created by the spark CRD (spark-driver, spark-executor)
In the Argo UI I see these logs:
I want to be able to archive those logs.
@Freia3 Current Resource template will not support archiving the log. Do you like to work on this enhancement?
@sarabala1979 Ok, thanks for the information, couldn't find this in the docs. No, I can't work on this enhancement.
@sarabala1979 Hello I wish to contribute on this one, It is relevant for my team and I believe that the fix is straightforward.
From what Ive seen there are to ways to solve it,
- Make the
argoexec resource
store the logs as artifact usingexecutor.WorkflowExecutor.SaveLogs
- I still need to check if it will be able to store its own container logs while still running - Initiate wait container for resource pods which by design stores logs of the main container here - I worry about redundant error reporting but I think it is safe
tbh I think 2 is a better option, wdyt?
I'm probably in agreement about 2 being the right way to do it. @sarabala1979, can you pitch in?
Hey @sarabala1979 do you want me to create pull request?