pipelines
pipelines copied to clipboard
[backend] Panic while connection to default cache endpoint ml-pipeline.kubeflow:8887
Environment
- How did you deploy Kubeflow Pipelines (KFP)? Manifests
- KFP version: 2.0.0
- KFP SDK version:
kfp 2.0.1
kfp-pipeline-spec 0.2.2
kfp-server-api 2.0.0
Steps to reproduce
Hello, we are trying the migration from pipelines 1.8.5 to 2.0.0 but after the apply we are aheving some issues.
Running the "hello world" example from the jupyerlab:
from kfp import dsl
import kfp
from kfp import dsl
@dsl.component
def say_hello(name: str) -> str:
hello_text = f'Hello, {name}!'
print(hello_text)
return hello_text
@dsl.pipeline
def hello_pipeline(recipient: str) -> str:
hello_task = say_hello(name=recipient)
return hello_task.output
from kfp import compiler
compiler.Compiler().compile(hello_pipeline, 'pipeline.yaml')
from kfp.client import Client
client = Client()
run = client.create_run_from_pipeline_package(
'pipeline.yaml',
arguments={
'recipient': 'World',
},
)
Or running the generated pipeline.yaml
from the result directly though the UI, we always get the following error on the third pod that is started:
time="2023-07-05T14:19:23.912Z" level=info msg="capturing logs" argo=true
time="2023-07-05T14:19:23.945Z" level=info msg="capturing logs" argo=true
I0705 14:19:23.966873 51 launcher_v2.go:90] input ComponentSpec:{
"inputDefinitions": {
"parameters": {
"name": {
"parameterType": "STRING"
}
}
},
"outputDefinitions": {
"parameters": {
"Output": {
"parameterType": "STRING"
}
}
},
"executorLabel": "exec-say-hello"
}
I0705 14:19:23.967498 51 cache.go:139] Cannot detect ml-pipeline in the same namespace, default to ml-pipeline.kubeflow:8887 as KFP endpoint.
I0705 14:19:23.967512 51 cache.go:116] Connecting to cache endpoint ml-pipeline.kubeflow:8887
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x941c29]
goroutine 1 [running]:
github.com/kubeflow/pipelines/backend/src/v2/metadata.(*Client).PublishExecution(0xc000b29920, {0x20a4878, 0xc000058040}, 0x0, 0x0, {0x0, 0x0, 0xc000b60000?}, 0x4)
/go/src/github.com/kubeflow/pipelines/backend/src/v2/metadata/client.go:388 +0x69
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).publish(0x1d3c167?, {0x20a4878?, 0xc000058040?}, 0x1?, 0x1?, {0x0?, 0x1a51660?, 0xc0006a63a0?}, 0xc73bb0?)
/go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:266 +0x9b
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).Execute.func2()
/go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:144 +0x65
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).Execute(0xc00028e540, {0x20a4878, 0xc000058040})
/go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:156 +0x91e
main.run()
/go/src/github.com/kubeflow/pipelines/backend/src/v2/cmd/launcher-v2/main.go:98 +0x3ed
main.main()
/go/src/github.com/kubeflow/pipelines/backend/src/v2/cmd/launcher-v2/main.go:47 +0x19
time="2023-07-05T14:19:24.950Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 2
time="2023-07-05T14:19:25.918Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 2
The service ml-pipeline.kubeflow:8887
exists.
Everything works great on version 1.8.5.
If you need the logs from the others two pods please let me know. I also check the logs in all the kubeflow services and I can't find any issue.
Impacted by this bug? Give it a 👍.
/assign @Linchin
Hi @andre-lx, thank you for bringing up this issue. I tried the same pipeline on a newly deployed 2.0.0 cluster, and the run finished without issue. looking at the log you provided, we have
github.com/kubeflow/pipelines/backend/src/v2/metadata.(*Client).PublishExecution(0xc000b29920, {0x20a4878, 0xc000058040}, 0x0, 0x0, {0x0, 0x0, 0xc000b60000?}, 0x4) /go/src/github.com/kubeflow/pipelines/backend/src/v2/metadata/client.go:388 +0x69
The metadata client seems to come from version 2.0.0-rc.2 instead of version 2.0.0. Could you double check if you applied the manifest of version 2.0.0? Try apply the manifest again (here) and see if the issue persists.
Also, could you let me know which way you used to deploy KFP, standalone or via kubeflow?
Hi @Linchin, I just checked and we are using the following image: https://github.com/kubeflow/pipelines/blob/e03e31219387b587b700ba3e31a02df486aa364f/manifests/kustomize/base/metadata/base/kustomization.yaml#L10-L12
The deployment was done using the follwing file: https://github.com/kubeflow/pipelines/blob/2.0.0/manifests/kustomize/env/platform-agnostic-multi-user/kustomization.yaml
Thanks
Hi @andre-lx @Linchin Same issue we are also facing. Did you get a chance to fix it?
Hi @andre-lx @Linchin Same issue we are also facing. Did you get a chance to fix it?
I had to revert it to 1.8.5 for now.
- @chensun for visibility of this issue
I have the same error. Here are the details.
- Running in standalone mode
- Running in virtual cluster (everything is working but cannot run pipelines)
- All pods are working
- I can upload and run pipelines on UI, but the pod is failing
- Using the pipelines version 2.0.0
- Generating the pipeline with the command below kfp dsl compile --py v2/hello_world.py --output hello_world.pipeline.json
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I also have this issue in my Kubeflow 1.8 environment. Kubeflow 1.8 is using the pipelines backend 2.0.3
I released my environment with the kubeflow manifest 1.8.
Can someone fix this issue?
the same issue on kubeflow 1.8
I have faced a similar issue. I have full Kubeflow 1.8 environment installed and the pipeline backend metadata envoy is 2.0.3 version. Is this issue resolved?
I've faced similar issue, and it was due to proxy setting on the pod/step. After removing proxy setting the issue was gone.
@umka1332 This solved the problem for me also. But do you know a way how I can still set proxy env vars to connect to the internet?
Just tested successfully that setting NO_PROXY to '*.kubeflow,*.local' seems to work together with http(s)_proxy. It makes sense that the connection to ml-pipeline fails without NO_PROXY because then all traffic will be routed through the given proxy. It is just strange that it has seemed to work before updating kubeflow.
If anyone following this can reliably reproduce this issue...
we always get the following error on the third pod that is started
I also need to see the log on the second pod (driver) that is started. Thanks.
@umka1332 This solved the problem for me also. But do you know a way how I can still set proxy env vars to connect to the internet?
Just tested successfully that setting NO_PROXY to '.kubeflow,.local' seems to work together with http(s)_proxy. It makes sense that the connection to ml-pipeline fails without NO_PROXY because then all traffic will be routed through the given proxy. It is just strange that it has seemed to work before updating kubeflow.
How did you solve this? I tried to set the no_proxy environment variables but it did not work for me. @umka1332
@umka1332 This solved the problem for me also. But do you know a way how I can still set proxy env vars to connect to the internet?
Just tested successfully that setting NO_PROXY to '.kubeflow,.local' seems to work together with http(s)_proxy. It makes sense that the connection to ml-pipeline fails without NO_PROXY because then all traffic will be routed through the given proxy. It is just strange that it has seemed to work before updating kubeflow.
How did you solve this? I tried to set the no_proxy environment variables but it did not work for me. @umka1332
Important is to set NO_PROXY
(so all uppercase). Also I had to add the kube api-server IP to NO_PROXY.
1.8.1 kubeflow has the same problem....
I solved this problem by delete proxy, you guys must delete proxy, if you need packages you need make a image that you can use.
from kfp import dsl
from kfp import compiler
@dsl.component()
def say_hello() :
import time
time.sleep(1900)
hello_text = f'Hello!'
print(hello_text)
@dsl.pipeline
def hello_pipeline():
hello_task = say_hello()
hello_task.set_env_variable(name='NO_PROXY', value='*.kubeflow,*.local')
hello_task.set_env_variable(name='no_proxy', value='*.kubeflow,*.local')
hello_task.set_caching_options(False)
compiler.Compiler().compile(hello_pipeline, package_path='pipeline.yaml')
I tried running this but it did not work for me. Is there somethin I am missing here. @pschoen-itsc @umka1332
@suanshs Seems like you are having a different problem. If you don't have any proxies set to begin with, then you also should not need the NO_PROXY settings. Can you provide logs of all the containers of the failing pod?
@pschoen-itsc Following are the logs from main container of the failing pod
time="2024-08-28T14:19:16.866Z" level=info msg="capturing logs" argo=true
time="2024-08-28T14:19:16.900Z" level=info msg="capturing logs" argo=true
I0828 14:19:16.922099 53 launcher_v2.go:90] input ComponentSpec:{
"executorLabel": "exec-say-hello"
}
I0828 14:19:16.922671 53 cache.go:116] Connecting to cache endpoint ml-pipeline.kubeflow:8887
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x941c29]
goroutine 1 [running]:
github.com/kubeflow/pipelines/backend/src/v2/metadata.(*Client).PublishExecution(0xc000afc720, {0x20a4878, 0xc000196000}, 0x0, 0x0, {0x0, 0x0, 0xc0004dc000?}, 0x4)
/go/src/github.com/kubeflow/pipelines/backend/src/v2/metadata/client.go:388 +0x69
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).publish(0x467387?, {0x20a4878?, 0xc000196000?}, 0x1?, 0x1?, {0x0?, 0x1a51660?, 0xc0004c6060?}, 0xbbfbb0?)
/go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:266 +0x9b
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).Execute.func2()
/go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:144 +0x65
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).Execute(0xc000306460, {0x20a4878, 0xc000196000})
/go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:156 +0x91e
main.run()
/go/src/github.com/kubeflow/pipelines/backend/src/v2/cmd/launcher-v2/main.go:98 +0x3ed
main.main()
/go/src/github.com/kubeflow/pipelines/backend/src/v2/cmd/launcher-v2/main.go:47 +0x19
time="2024-08-28T14:19:17.903Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 2
time="2024-08-28T14:19:18.871Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 2
Following are the logs from wait container
time="2024-08-28T14:19:16.138Z" level=info msg="Starting Workflow Executor" executorType=emissary version=v3.3.10
time="2024-08-28T14:19:16.141Z" level=info msg="Creating a emissary executor"
time="2024-08-28T14:19:16.141Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-08-28T14:19:16.141Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=kubeflow podName=hello-pipeline-2clrb-1334336905 template="{\"name\":\"system-container-impl\",\"inputs\":{\"parameters\":[{\"name\":\"pod-spec-patch\",\"value\":\"{\\\"containers\\\":[{\\\"name\\\":\\\"main\\\",\\\"image\\\":\\\"docker-dev-artifactory.workday.com/ml/kubeflow/python-3.7:latest\\\",\\\"command\\\":[\\\"/var/run/argo/argoexec\\\",\\\"emissary\\\",\\\"--\\\",\\\"/kfp-launcher/launch\\\",\\\"--pipeline_name\\\",\\\"hello-pipeline\\\",\\\"--run_id\\\",\\\"5610709d-50b9-4833-8e2d-7e72a19a97ec\\\",\\\"--execution_id\\\",\\\"91\\\",\\\"--executor_input\\\",\\\"{\\\\\\\"inputs\\\\\\\":{},\\\\\\\"outputs\\\\\\\":{\\\\\\\"outputFile\\\\\\\":\\\\\\\"/tmp/kfp_outputs/output_metadata.json\\\\\\\"}}\\\",\\\"--component_spec\\\",\\\"{\\\\\\\"executorLabel\\\\\\\":\\\\\\\"exec-say-hello\\\\\\\"}\\\",\\\"--pod_name\\\",\\\"$(KFP_POD_NAME)\\\",\\\"--pod_uid\\\",\\\"$(KFP_POD_UID)\\\",\\\"--mlmd_server_address\\\",\\\"$(METADATA_GRPC_SERVICE_HOST)\\\",\\\"--mlmd_server_port\\\",\\\"tcp://10.100.242.77:8080\\\",\\\"--\\\"],\\\"args\\\":[\\\"sh\\\",\\\"-c\\\",\\\"\\\\nif ! [ -x \\\\\\\"$(command -v pip)\\\\\\\" ]; then\\\\n python3 -m ensurepip || python3 -m ensurepip --user || apt-get install python3-pip\\\\nfi\\\\n\\\\nPIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet --no-warn-script-location 'kfp==2.0.1' \\\\u0026\\\\u0026 \\\\\\\"$0\\\\\\\" \\\\\\\"$@\\\\\\\"\\\\n\\\",\\\"sh\\\",\\\"-ec\\\",\\\"program_path=$(mktemp -d)\\\\nprintf \\\\\\\"%s\\\\\\\" \\\\\\\"$0\\\\\\\" \\\\u003e \\\\\\\"$program_path/ephemeral_component.py\\\\\\\"\\\\npython3 -m kfp.components.executor_main --component_module_path \\\\\\\"$program_path/ephemeral_component.py\\\\\\\" \\\\\\\"$@\\\\\\\"\\\\n\\\",\\\"\\\\nimport kfp\\\\nfrom kfp import dsl\\\\nfrom kfp.dsl import *\\\\nfrom typing import *\\\\n\\\\ndef say_hello() :\\\\n import time\\\\n time.sleep(1900)\\\\n hello_text = f'Hello, Suansh!'\\\\n print(hello_text)\\\\n\\\\n\\\",\\\"--executor_input\\\",\\\"{{$}}\\\",\\\"--function_to_execute\\\",\\\"say_hello\\\"],\\\"env\\\":[{\\\"name\\\":\\\"NO_PROXY\\\",\\\"value\\\":\\\"172.17.68.189,.kubeflow,.local\\\"},{\\\"name\\\":\\\"no_proxy\\\",\\\"value\\\":\\\"172.17.68.189,.kubeflow,.local\\\"}],\\\"resources\\\":{}}]}\"}]},\"outputs\":{},\"metadata\":{\"annotations\":{\"sidecar.istio.io/inject\":\"false\"}},\"container\":{\"name\":\"\",\"image\":\"gcr.io/ml-pipeline/should-be-overridden-during-runtime\",\"command\":[\"should-be-overridden-during-runtime\"],\"envFrom\":[{\"configMapRef\":{\"name\":\"metadata-grpc-configmap\",\"optional\":true}}],\"env\":[{\"name\":\"KFP_POD_NAME\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.name\"}}},{\"name\":\"KFP_POD_UID\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.uid\"}}}],\"resources\":{},\"volumeMounts\":[{\"name\":\"kfp-launcher\",\"mountPath\":\"/kfp-launcher\"}]},\"volumes\":[{\"name\":\"kfp-launcher\",\"emptyDir\":{}}],\"initContainers\":[{\"name\":\"kfp-launcher\",\"image\":\"gcr.io/ml-pipeline/kfp-launcher@sha256:80cf120abd125db84fa547640fd6386c4b2a26936e0c2b04a7d3634991a850a4\",\"command\":[\"launcher-v2\",\"--copy\",\"/kfp-launcher/launch\"],\"resources\":{\"limits\":{\"cpu\":\"500m\",\"memory\":\"128Mi\"},\"requests\":{\"cpu\":\"100m\"}},\"volumeMounts\":[{\"name\":\"kfp-launcher\",\"mountPath\":\"/kfp-launcher\"}]}],\"archiveLocation\":{\"archiveLogs\":true,\"s3\":{\"endpoint\":\"minio.kubeflow:9000\",\"bucket\":\"mlpipeline\",\"insecure\":true,\"accessKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"accesskey\"},\"secretKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"secretkey\"},\"key\":\"artifacts/kubeflow/hello-pipeline-2clrb/2024-08-28/hello-pipeline-2clrb-1334336905\"}},\"podSpecPatch\":\"{\\\"containers\\\":[{\\\"name\\\":\\\"main\\\",\\\"image\\\":\\\"docker-dev-artifactory.workday.com/ml/kubeflow/python-3.7:latest\\\",\\\"command\\\":[\\\"/var/run/argo/argoexec\\\",\\\"emissary\\\",\\\"--\\\",\\\"/kfp-launcher/launch\\\",\\\"--pipeline_name\\\",\\\"hello-pipeline\\\",\\\"--run_id\\\",\\\"5610709d-50b9-4833-8e2d-7e72a19a97ec\\\",\\\"--execution_id\\\",\\\"91\\\",\\\"--executor_input\\\",\\\"{\\\\\\\"inputs\\\\\\\":{},\\\\\\\"outputs\\\\\\\":{\\\\\\\"outputFile\\\\\\\":\\\\\\\"/tmp/kfp_outputs/output_metadata.json\\\\\\\"}}\\\",\\\"--component_spec\\\",\\\"{\\\\\\\"executorLabel\\\\\\\":\\\\\\\"exec-say-hello\\\\\\\"}\\\",\\\"--pod_name\\\",\\\"$(KFP_POD_NAME)\\\",\\\"--pod_uid\\\",\\\"$(KFP_POD_UID)\\\",\\\"--mlmd_server_address\\\",\\\"$(METADATA_GRPC_SERVICE_HOST)\\\",\\\"--mlmd_server_port\\\",\\\"tcp://10.100.242.77:8080\\\",\\\"--\\\"],\\\"args\\\":[\\\"sh\\\",\\\"-c\\\",\\\"\\\\nif ! [ -x \\\\\\\"$(command -v pip)\\\\\\\" ]; then\\\\n python3 -m ensurepip || python3 -m ensurepip --user || apt-get install python3-pip\\\\nfi\\\\n\\\\nPIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet --no-warn-script-location 'kfp==2.0.1' \\\\u0026\\\\u0026 \\\\\\\"$0\\\\\\\" \\\\\\\"$@\\\\\\\"\\\\n\\\",\\\"sh\\\",\\\"-ec\\\",\\\"program_path=$(mktemp -d)\\\\nprintf \\\\\\\"%s\\\\\\\" \\\\\\\"$0\\\\\\\" \\\\u003e \\\\\\\"$program_path/ephemeral_component.py\\\\\\\"\\\\npython3 -m kfp.components.executor_main --component_module_path \\\\\\\"$program_path/ephemeral_component.py\\\\\\\" \\\\\\\"$@\\\\\\\"\\\\n\\\",\\\"\\\\nimport kfp\\\\nfrom kfp import dsl\\\\nfrom kfp.dsl import *\\\\nfrom typing import *\\\\n\\\\ndef say_hello() :\\\\n import time\\\\n time.sleep(1900)\\\\n hello_text = f'Hello, Suansh!'\\\\n print(hello_text)\\\\n\\\\n\\\",\\\"--executor_input\\\",\\\"{{$}}\\\",\\\"--function_to_execute\\\",\\\"say_hello\\\"],\\\"env\\\":[{\\\"name\\\":\\\"NO_PROXY\\\",\\\"value\\\":\\\"172.17.68.189,.kubeflow,.local\\\"},{\\\"name\\\":\\\"no_proxy\\\",\\\"value\\\":\\\"172.17.68.189,.kubeflow,.local\\\"}],\\\"resources\\\":{}}]}\"}" version="&Version{Version:v3.3.10,BuildDate:2022-11-29T18:18:30Z,GitCommit:b19870d737a14b21d86f6267642a63dd14e5acd5,GitTag:v3.3.10,GitTreeState:clean,GoVersion:go1.17.13,Compiler:gc,Platform:linux/amd64,}"
time="2024-08-28T14:19:16.141Z" level=info msg="Starting deadline monitor"
time="2024-08-28T14:19:18.142Z" level=info msg="Main container completed"
time="2024-08-28T14:19:18.142Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-08-28T14:19:18.142Z" level=info msg="Saving logs"
time="2024-08-28T14:19:18.142Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: artifacts/kubeflow/hello-pipeline-2clrb/2024-08-28/hello-pipeline-2clrb-1334336905/main.log"
time="2024-08-28T14:19:18.142Z" level=info msg="Creating minio client using static credentials" endpoint="minio.kubeflow:9000"
time="2024-08-28T14:19:18.142Z" level=info msg="Saving file to s3" bucket=mlpipeline endpoint="minio.kubeflow:9000" key=artifacts/kubeflow/hello-pipeline-2clrb/2024-08-28/hello-pipeline-2clrb-1334336905/main.log path=/tmp/argo/outputs/logs/main.log
time="2024-08-28T14:19:18.151Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2024-08-28T14:19:18.151Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2024-08-28T14:19:18.151Z" level=info msg="No output parameters"
time="2024-08-28T14:19:18.151Z" level=info msg="No output artifacts"
time="2024-08-28T14:19:18.168Z" level=info msg="Create workflowtaskresults 201"
time="2024-08-28T14:19:18.169Z" level=info msg="Killing sidecars []"
time="2024-08-28T14:19:18.169Z" level=info msg="Alloc=6749 TotalAlloc=12722 Sys=24786 NumGC=4 Goroutines=9"
Following are the logs from
@suanshs Do you also have logs of the istio sidecar or do you have no istio deployed?
Just tested successfully that setting NO_PROXY to '.kubeflow,.local' seems to work together with http(s)_proxy. It makes sense that the connection to ml-pipeline fails without NO_PROXY because then all traffic will be routed through the given proxy. It is just strange that it has seemed to work before updating kubeflow.
Thanks! This helped me a lot!