katib icon indicating copy to clipboard operation
katib copied to clipboard

Failed to launch Katib experiment - 404 page not found

Open anneum opened this issue 4 years ago • 15 comments

/kind bug

What steps did you take and what happened: Setup kubeflow and installed katib manually as mentioned in https://github.com/kubeflow/katib/issues/1415 Start a katib experiment with kale out of a jupyter notebook. The experiment was created and the pipeline was also uploaded but not launched.

Type: RPC

Method: katib.create_katib_experiment()

Code: 6 (UnhandledError)

Transaction ID: ylpewg72bh

Message: Failed to launch Katib experiment

Details: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'text/plain; charset=utf-8', 'X-Content-Type-Options': 'nosniff', 'Date': 'Mon, 08 Mar 2021 12:14:43 GMT', 'Content-Length': '19'})
HTTP response body: 404 page not found

kale.log:

2021-03-08 12:14:42 run:83 [[DEBUG]] [TID=axkqxleth9] [] Decoding ctx of RPC function 'kfp.create_experiment'
2021-03-08 12:14:42 run:95 [[DEBUG]] [TID=axkqxleth9] [/home/jovyan/medium/minikf/titanic-katib.ipynb] Decoding kwargs of RPC function 'kfp.create_experiment'
2021-03-08 12:14:42 run:104 [[DEBUG]] [TID=axkqxleth9] [/home/jovyan/medium/minikf/titanic-katib.ipynb] Importing RPC function 'kfp.create_experiment'
2021-03-08 12:14:42 run:114 [[INFO]] [TID=axkqxleth9] [/home/jovyan/medium/minikf/titanic-katib.ipynb] Executing RPC function 'create_experiment(experiment_name=test-v1ef9)'
2021-03-08 12:14:43 _client:352 [[INFO]] Creating experiment test-v1ef9.
2021-03-08 12:14:43 run:83 [[DEBUG]] [TID=ylpewg72bh] [] Decoding ctx of RPC function 'katib.create_katib_experiment'
2021-03-08 12:14:43 run:95 [[DEBUG]] [TID=ylpewg72bh] [/home/jovyan/medium/minikf/titanic-katib.ipynb] Decoding kwargs of RPC function 'katib.create_katib_experiment'
2021-03-08 12:14:43 run:104 [[DEBUG]] [TID=ylpewg72bh] [/home/jovyan/medium/minikf/titanic-katib.ipynb] Importing RPC function 'katib.create_katib_experiment'
2021-03-08 12:14:43 run:114 [[INFO]] [TID=ylpewg72bh] [/home/jovyan/medium/minikf/titanic-katib.ipynb] Executing RPC function 'create_katib_experiment(pipeline_id=832dfc28-61be-4fb5-af12-7877778b26ef, pipeline_metadata={'autosnapshot': True, 'docker_image': 'jupyter-kale:latest', 'experiment': {'id': '7f611f1b-bf8e-4709-80ef-c55d6644931c', 'name': 'test'}, 'experiment_name': 'test-v1ef9', 'katib_metadata': {'parameters': [{'feasibleSpace': {'max': '2000', 'min': '100', 'step': '100'}, 'name': 'N_ESTIMATORS', 'parameterType': 'int'}, {'feasibleSpace': {'list': ['10', '20', '30', '40', '50', '100']}, 'name': 'MAX_DEPTH', 'parameterType': 'categorical'}, {'feasibleSpace': {'max': '4', 'min': '1', 'step': '1'}, 'name': 'MIN_SAMPLES_LEAF', 'parameterType': 'int'}, {'feasibleSpace': {'list': ['2', '5', '10']}, 'name': 'MIN_SAMPLES_SPLIT', 'parameterType': 'categorical'}], 'objective': {'additionalMetricNames': [], 'goal': 0.85, 'objectiveMetricName': 'random-forest-accuracy', 'type': 'maximize'}, 'algorithm': {'algorithmName': 'random', 'algorithmSettings': [{'name': 'random_state', 'value': '10'}, {'name': 'acq_optimizer', 'value': 'auto'}, {'name': 'acq_func', 'value': 'gp_hedge'}, {'name': 'base_estimator', 'value': 'GP'}]}, 'maxTrialCount': 10, 'maxFailedTrialCount': 3, 'parallelTrialCount': 5}, 'katib_run': True, 'pipeline_description': 'Fine tune a RF classifier on the Titanic dataset', 'pipeline_name': 'titanic-hp-tuning', 'snapshot_volumes': True, 'steps_defaults': [], 'volumes': []}, output_path=/home/jovyan/medium/minikf)'
2021-03-08 12:14:43 katib:181 [[INFO]] [TID=ylpewg72bh] [/home/jovyan/medium/minikf/titanic-katib.ipynb] Saving Katib experiment definition at /home/jovyan/medium/minikf/test-v1ef9.katib.yaml
2021-03-08 12:14:43 katib:91 [[DEBUG]] [TID=ylpewg72bh] [/home/jovyan/medium/minikf/titanic-katib.ipynb] Launching Katib Experiment 'test-v1ef9'...
2021-03-08 12:14:43 katib:97 [[ERROR]] [TID=ylpewg72bh] [/home/jovyan/medium/minikf/titanic-katib.ipynb] Failed to launch Katib experiment
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/kale/rpc/katib.py", line 95, in _launch_katib_experiment
    katib_experiment)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/apis/custom_objects_api.py", line 178, in create_namespaced_custom_object
    (data) = self.create_namespaced_custom_object_with_http_info(group, version, namespace, plural, body, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/apis/custom_objects_api.py", line 277, in create_namespaced_custom_object_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 334, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 168, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 377, in request
    body=body)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/rest.py", line 265, in POST
    body=body)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/rest.py", line 221, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'text/plain; charset=utf-8', 'X-Content-Type-Options': 'nosniff', 'Date': 'Mon, 08 Mar 2021 12:14:43 GMT', 'Content-Length': '19'})
HTTP response body: 404 page not found

What did you expect to happen: The katib experiment is launched.

Anything else you would like to add: Can I figure out which DNS address is being requested?

Environment:

  • Kubeflow version: kfctl v1.2.0-0-gbc038f9
  • OnPremise Kubernetes Cluster
  • Kubernetes version: v1.17.0
  • OS: Ubuntu 18.04.5 LTS

anneum avatar Mar 08 '21 12:03 anneum

@anneum Once you create Katib SDK client you can pass the kubeconfig path: https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/api/katib_client.py#L33-L43.

andreyvelich avatar Mar 09 '21 11:03 andreyvelich

@andreyvelich thank you for your response. I don't fully understand what you mean by that. The notebook server was created from the kubeflow Notebook Servers Section and is therefore already inside the cluster. image

I click "compile and run katib job" and there is no option where I can pass something like the kubeconfig path.

anneum avatar Mar 09 '21 12:03 anneum

Got it. This issue might refer to Kale itself. /cc @StefanoFioravanzo @yanniszark

andreyvelich avatar Mar 09 '21 13:03 andreyvelich

@andreyvelich thank for the ping. Kale currently doesn't support providing a custom Kubeconfig. Can you make sure the notebook Pod does have a proper kubeconfig and you can query for experiments with kubectl?

StefanoFioravanzo avatar Mar 09 '21 18:03 StefanoFioravanzo

Thanks @StefanoFioravanzo. @anneum Please try to execute kubectl command from your notebook.

andreyvelich avatar Mar 09 '21 20:03 andreyvelich

@StefanoFioravanzo I am very surprised about this but yes, I see the pods in my namespace.

tf-docker ~ > kubectl get pods
NAME                                               READY   STATUS    RESTARTS   AGE
jupyter-kale-0                                     2/2     Running   4          15d
ml-pipeline-ui-artifact-8669b444d8-mq4wd           2/2     Running   4          15d
ml-pipeline-visualizationserver-744ffd6cdf-x57tb   2/2     Running   2          15d
test-0                                             2/2     Running   2          4d21h

The question about which URL is called is due to the fact that we have a company proxy and I would like to exclude that it is because of that.

anneum avatar Mar 10 '21 08:03 anneum

@anneum Did you also try to run KFP pipelines? I'd like to understand if this issue is confined to creating Katib experiments or if it is an issue on the Kale side not being able to contact the K8s API Server.

StefanoFioravanzo avatar Mar 11 '21 15:03 StefanoFioravanzo

@StefanoFioravanzo I can create and run standard pipelines (without a katib job) out of my notebook server.

tf-docker ~ > kfp pipeline list
+--------------------------------------+-------------------------------------------------+---------------------------+
| Pipeline ID                          | Name                                            | Uploaded at               |
+======================================+=================================================+===========================+
| c5fda645-e075-4f09-964f-66593d1ce87e | pipeline-p147r                                  | 2021-03-03T13:13:43+00:00 |
+--------------------------------------+-------------------------------------------------+---------------------------+

anneum avatar Mar 11 '21 15:03 anneum

same probelm as here , get 404 error when i try to crete katib_experiment by SDK。do u fix your problem ?

Ulov888 avatar May 19 '21 07:05 Ulov888

@Ulov888 How did you install Katib?

johnugeorge avatar May 19 '21 17:05 johnugeorge

I am getting a 404 error when I try to create katib_experiment by SDK using the Notebook servers. image I'm trying to implement this example. Versions: Python : 3.6.8 Kubeflow : kfctl_k8s_istio.v1.0.1 kubeflow-katib 0.10.1 kubernetes 10.0.1

Siddarth-Pattnaik avatar Aug 04 '21 07:08 Siddarth-Pattnaik

@Siddarth-Pattnaik I think you should update your Kubeflow version to 1.1 at least to use Katib SDK. In Kubeflow 1.0.1 we had Katib v1alpha3 version which SDK doesn't support.

andreyvelich avatar Aug 04 '21 16:08 andreyvelich

@StefanoFioravanzo as with @anneum, I can open a terminal from within the notebook that is capable of running both kubectl get pods as well as kfp pipeline list without issue.

The issue remains as follows:

2021-09-17 03:14:28 run:120 [ERROR] [TID=jhypgoxrcf] [/home/jovyan/Untitled.ipynb] RPC function 'create_katib_experiment' raised an RPCError
Traceback (most recent call last):
  File "/kale/backend/kale/rpc/katib.py", line 107, in _launch_katib_experiment
    katib_experiment)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api/custom_objects_api.py", line 183, in create_namespaced_custom_object
    (data) = self.create_namespaced_custom_object_with_http_info(group, version, namespace, plural, body, **kwargs)  # noqa: E501
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api/custom_objects_api.py", line 289, in create_namespaced_custom_object_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 345, in call_api
    _preload_content, _request_timeout)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 176, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 388, in request
    body=body)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/rest.py", line 278, in POST
    body=body)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/rest.py", line 231, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'text/plain; charset=utf-8', 'X-Content-Type-Options': 'nosniff', 'Date': 'Fri, 17 Sep 2021 03:14:28 GMT', 'Content-Length': '19'})
HTTP response body: 404 page not found

When I attempt to create a job, the pipeline is uploaded without issue, as well as the KFP experiment (both of which attempt to access the k8s api), what breaks is the Katib experiment itself.

image

This issue has persisted across multiple kale notebook versions, my KF version is 1.3


@andreyvelich I can use the SDK normally (following this guide ) within the notebook terminal image

Please advise?

minaadel avatar Sep 17 '21 03:09 minaadel

I see the same issue on Kubeflow 1.3. Was anyone able to fix this?

2021-12-03 23:15:24 run:120 [ERROR] [TID=kl824ee10y] [/home/jovyan/kale/examples/dog-breed-classification/dog-breed-katib.ipynb] RPC function 'create_katib_experiment' raised an RPCError Traceback (most recent call last):   File "/usr/local/lib/python3.6/dist-packages/kale/rpc/katib.py", line 104, in _launch_katib_experiment     katib_experiment)   File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/apis/custom_objects_api.py", line 178, in create_namespaced_custom_object     (data) = self.create_namespaced_custom_object_with_http_info(group, version, namespace, plural, body, **kwargs)   File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/apis/custom_objects_api.py", line 277, in create_namespaced_custom_object_with_http_info     collection_formats=collection_formats)   File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 334, in call_api     _return_http_data_only, collection_formats, _preload_content, _request_timeout)   File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 168, in __call_api     _request_timeout=_request_timeout)   File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 377, in request     body=body)   File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/rest.py", line 266, in POST     body=body)   File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/rest.py", line 222, in request     raise ApiException(http_resp=r) kubernetes.client.rest.ApiException: (404) Reason: Not Found HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'text/plain; charset=utf-8', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': 'ab77065d-ed04-4fa5-bd00-a66298a0e074', 'X-Kubernetes-Pf-Prioritylevel-Uid': '55506d34-196f-4b07-b92d-80a43a2898e6', 'Date': 'Fri, 03 Dec 2021 23:15:24 GMT', 'Content-Length': '19'}) HTTP response body: 404 page not found

pshegde avatar Dec 07 '21 06:12 pshegde

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 16 '22 06:04 stale[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

github-actions[bot] avatar Sep 13 '23 10:09 github-actions[bot]