katib
katib copied to clipboard
GPU not consuming for Katib experiment - GKE Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
/kind bug
What steps did you take and what happened: I am trying to create a kubeflow pipeline that tunes the hyper parameters of a text classification model in tensorflow using katib on GKE clusters. I created a cluster using the below commands
CLUSTER_NAME="kubeflow-pipelines-standalone-v2"
ZONE="us-central1-a"
MACHINE_TYPE="n1-standard-2"
SCOPES="cloud-platform"
NODES_NUM=1
gcloud container clusters create $CLUSTER_NAME --zone $ZONE --machine-type $MACHINE_TYPE --scopes $SCOPES --num-nodes $NODES_NUM
gcloud config set compute/zone $ZONE
gcloud container clusters get-credentials $CLUSTER_NAME
export PIPELINE_VERSION=1.8.2
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/dev?ref=$PIPELINE_VERSION"
# katib
kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.13.0"
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.4.0"
kubectl apply -f ./test.yaml
# disabling caching
export NAMESPACE=kubeflow
kubectl get mutatingwebhookconfiguration cache-webhook-${NAMESPACE}
kubectl patch mutatingwebhookconfiguration cache-webhook-${NAMESPACE} --type='json' -p='[{"op":"replace", "path": "/webhooks/0/rules/0/operations/0", "value": "DELETE"}]'
kubectl describe configmap inverse-proxy-config -n kubeflow | grep googleusercontent.com
GPU_POOL_NAME="gpu-pool2"
CLUSTER_NAME="kubeflow-pipelines-standalone-v2"
CLUSTER_ZONE="us-central1-a"
GPU_TYPE="nvidia-tesla-k80"
GPU_COUNT=1
MACHINE_TYPE="n1-highmem-8"
NODES_NUM=1
# Node pool creation may take several minutes.
gcloud container node-pools create ${GPU_POOL_NAME} --accelerator type=${GPU_TYPE},count=${GPU_COUNT} --zone ${CLUSTER_ZONE} --cluster ${CLUSTER_NAME} --num-nodes=0 --machine-type=${MACHINE_TYPE} --scopes=cloud-platform --num-nodes $NODES_NUM
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
I then created a kubeflow pipeline:
from kfp import compiler
import kfp
import kfp.dsl as dsl
from kfp import components
@dsl.pipeline(
name="End to End Pipeline",
description="An end to end mnist example including hyperparameter tuning, train and inference"
)
def pipeline_func(
time_loc = "gs://faris_bucket_us_central/Pipeline_data/input_dataset/dbpedia_model/GKE_Katib/time_csv/",
hyper_image_uri_train = "gcr.io/.............../hptunekatib:v7",
hyper_image_uri = "gcr.io/.............../hptunekatibclient:v7",
model_uri = "gs://faris_bucket_us_central/Pipeline_data/dbpedia_hyper_models/GKE_Katib/",
experiment_name = "dbpedia-exp-1",
experiment_namespace = "kubeflow",
experiment_timeout_minutes = 60
):
# first stage : ingest and preprocess -> returns uploaded gcs URI for the pre processed dataset, setting memmory to 32GB, CPU to 4 CPU
hp_tune = dsl.ContainerOp(
name='hp-tune-katib',
image=hyper_image_uri,
arguments=[
'--experiment_name', experiment_name,
'--experiment_namespace', experiment_namespace,
'--experiment_timeout_minutes', experiment_timeout_minutes,
'--delete_after_done', True,
'--hyper_image_uri', hyper_image_uri_train,
'--time_loc', time_loc,
'--model_uri', model_uri
],
file_outputs={'best-params': '/output.txt'}
).set_gpu_limit(1)
# restricting the maximum usable memory and cpu for preprocess stage
hp_tune.set_memory_limit("49G")
hp_tune.set_cpu_limit("7")
# Run the Kubeflow Pipeline in the user's namespace.
if __name__ == '__main__':
# compiling the model and generating tar.gz file to upload to Kubeflow Pipeline UI
import kfp.compiler as compiler
compiler.Compiler().compile(
pipeline_func, 'pipeline_db.tar.gz'
)
These are my two continers.
- To launch the katib experiments based on the specified parameters and arguments passed to the dsl.ContainerOp()
- The main training script for text classification. This container is passed as "image" to the trial spec for katib
gcr.io/.............../hptunekatibclient:v7
# importing required packages
import argparse
import datetime
from datetime import datetime as dt
from distutils.util import strtobool
import json
import os
import logging
import time
import pandas as pd
from google.cloud import storage
from pytz import timezone
from kubernetes.client import V1ObjectMeta
from kubeflow.katib import KatibClient
from kubeflow.katib import ApiClient
from kubeflow.katib import V1beta1Experiment
from kubeflow.katib import ApiClient
from kubeflow.katib import V1beta1ExperimentSpec
from kubeflow.katib import V1beta1AlgorithmSpec
from kubeflow.katib import V1beta1ObjectiveSpec
from kubeflow.katib import V1beta1ParameterSpec
from kubeflow.katib import V1beta1FeasibleSpace
from kubeflow.katib import V1beta1TrialTemplate
from kubeflow.katib import V1beta1TrialParameterSpec
from kubeflow.katib import V1beta1MetricsCollectorSpec
from kubeflow.katib import V1beta1CollectorSpec
from kubeflow.katib import V1beta1FileSystemPath
from kubeflow.katib import V1beta1SourceSpec
from kubeflow.katib import V1beta1FilterSpec
logger = logging.getLogger()
logging.basicConfig(level=logging.INFO)
FINISH_CONDITIONS = ["Succeeded", "Failed"]
# function to record the start time and end time to calculate execution time, pipeline start up and teardown time
def write_time(types, time_loc):
formats = "%Y-%m-%d %I:%M:%S %p"
now_utc = dt.now(timezone('UTC'))
now_asia = now_utc.astimezone(timezone('Asia/Kolkata'))
start_time = str(now_asia.strftime(formats))
time_df = pd.DataFrame({"time":[start_time]})
print("written")
time_df.to_csv(time_loc + types + ".csv", index=False)
def get_args():
parser = argparse.ArgumentParser(description='Katib Experiment launcher')
parser.add_argument('--experiment_name', type=str,
help='Experiment name')
parser.add_argument('--experiment_namespace', type=str, default='anonymous',
help='Experiment namespace')
parser.add_argument('--experiment_timeout_minutes', type=int, default=60*24,
help='Time in minutes to wait for the Experiment to complete')
parser.add_argument('--delete_after_done', type=strtobool, default=True,
help='Whether to delete the Experiment after it is finished')
parser.add_argument('--hyper_image_uri', type=str, default="gcr.io/.............../hptunekatib:v2",
help='Hyper image uri')
parser.add_argument('--time_loc', type=str, default="gs://faris_bucket_us_central/Pipeline_data/input_dataset/dbpedia_model/GKE_Katib/time_csv/",
help='Time loc')
parser.add_argument('--model_uri', type=str, default="gs://faris_bucket_us_central/Pipeline_data/dbpedia_hyper_models/GKE_Katib/",
help='Model URI')
return parser.parse_args()
def wait_experiment_finish(katib_client, experiment, timeout):
polling_interval = datetime.timedelta(seconds=30)
end_time = datetime.datetime.now() + datetime.timedelta(minutes=timeout)
experiment_name = experiment.metadata.name
experiment_namespace = experiment.metadata.namespace
while True:
current_status = None
try:
current_status = katib_client.get_experiment_status(name=experiment_name, namespace=experiment_namespace)
except Exception as e:
logger.info("Unable to get current status for the Experiment: {} in namespace: {}. Exception: {}".format(
experiment_name, experiment_namespace, e))
# If Experiment has reached complete condition, exit the loop.
if current_status in FINISH_CONDITIONS:
logger.info("Experiment: {} in namespace: {} has reached the end condition: {}".format(
experiment_name, experiment_namespace, current_status))
return
# Print the current condition.
logger.info("Current condition for Experiment: {} in namespace: {} is: {}".format(
experiment_name, experiment_namespace, current_status))
# If timeout has been reached, rise an exception.
if datetime.datetime.now() > end_time:
raise Exception("Timout waiting for Experiment: {} in namespace: {} "
"to reach one of these conditions: {}".format(
experiment_name, experiment_namespace, FINISH_CONDITIONS))
# Sleep for poll interval.
time.sleep(polling_interval.seconds)
if __name__ == "__main__":
args = get_args()
write_time("hyper_parameter_tuning_start", args.time_loc)
# Trial count specification.
max_trial_count = 2
max_failed_trial_count = 2
parallel_trial_count = 1
# Objective specification.
objective = V1beta1ObjectiveSpec(
type="minimize",
# goal=100,
objective_metric_name="accuracy"
# additional_metric_names=["accuracy"]
)
# Objective specification.
# metrics_collector_specs = V1beta1MetricsCollectorSpec(
# collector=V1beta1CollectorSpec(kind="File"),
# source=V1beta1SourceSpec(
# file_system_path=V1beta1FileSystemPath(
# # format="TEXT",
# path="/opt/trainer/katib/metrics.log",
# kind="File"
# ),
# filter=V1beta1FilterSpec(
# # metrics_format=["{metricName: ([\\w|-]+), metricValue: ((-?\\d+)(\\.\\d+)?)}"]
# metrics_format=["([\w|-]+)\s*=\s*([+-]?\d*(\.\d+)?([Ee][+-]?\d+)?)"]
# )
# )
# )
# Algorithm specification.
algorithm = V1beta1AlgorithmSpec(
algorithm_name="random",
)
# Experiment search space.
# In this example we tune learning rate and batch size.
parameters = [
V1beta1ParameterSpec(
name="batch_size",
parameter_type="discrete",
feasible_space=V1beta1FeasibleSpace(
list=["32", "42", "52", "62", "64"]
),
),
V1beta1ParameterSpec(
name="learning_rate",
parameter_type="double",
feasible_space=V1beta1FeasibleSpace(
min="0.001",
max="0.005"
),
)
]
# TODO (andreyvelich): Use community image for the mnist example.
trial_spec = {
"apiVersion": "kubeflow.org/v1",
"kind": "TFJob",
"spec": {
"tfReplicaSpecs": {
"PS": {
"replicas": 1,
"restartPolicy": "Never",
"template": {
"metadata": {
"annotations": {
"sidecar.istio.io/inject": "false",
}
},
"spec": {
"containers": [
{
"name": "tensorflow",
"image": args.hyper_image_uri,
"command": [
"python",
"/opt/trainer/task.py",
"--model_uri=" + args.model_uri,
"--batch_size=${trialParameters.batchSize}",
"--learning_rate=${trialParameters.learningRate}"
],
"ports" : [
{
"containerPort": 2222,
"name" : "tfjob-port"
}
]
# "resources": {
# "limits" : {
# "cpu": "1"
# }
# }
}
]
}
}
},
"Worker": {
"replicas": 1,
"restartPolicy": "Never",
"template": {
"metadata": {
"annotations": {
"sidecar.istio.io/inject": "false",
}
},
"spec": {
"containers": [
{
"name": "tensorflow",
"image": args.hyper_image_uri,
"command": [
"python",
"/opt/trainer/task.py",
"--model_uri=" + args.model_uri,
"--batch_size=${trialParameters.batchSize}",
"--learning_rate=${trialParameters.learningRate}"
],
"ports" : [
{
"containerPort": 2222,
"name" : "tfjob-port"
}
]
# "resources": {
# "limits" : {
# "nvidia.com/gpu": 1
# }
# }
}
]
}
}
}
}
}
}
# Configure parameters for the Trial template.
trial_template = V1beta1TrialTemplate(
primary_container_name="tensorflow",
trial_parameters=[
V1beta1TrialParameterSpec(
name="batchSize",
description="batch size",
reference="batch_size"
),
V1beta1TrialParameterSpec(
name="learningRate",
description="Learning rate",
reference="learning_rate"
),
],
trial_spec=trial_spec
)
# Create an Experiment from the above parameters.
experiment_spec = V1beta1ExperimentSpec(
max_trial_count=max_trial_count,
max_failed_trial_count=max_failed_trial_count,
parallel_trial_count=parallel_trial_count,
objective=objective,
algorithm=algorithm,
parameters=parameters,
trial_template=trial_template
)
experiment_name = args.experiment_name
experiment_namespace = args.experiment_namespace
logger.info("Creating Experiment: {} in namespace: {}".format(experiment_name, experiment_namespace))
# Create Experiment object.
experiment = V1beta1Experiment(
api_version="kubeflow.org/v1beta1",
kind="Experiment",
metadata=V1ObjectMeta(
name=experiment_name,
namespace=experiment_namespace
),
spec=experiment_spec
)
logger.info("Experiment Spec : " + str(experiment_spec))
logger.info("Experiment: " + str(experiment))
# Create Katib client.
katib_client = KatibClient()
# Create Experiment in Kubernetes cluster.
output = katib_client.create_experiment(experiment, namespace=experiment_namespace)
# Wait until Experiment is created.
end_time = datetime.datetime.now() + datetime.timedelta(minutes=60)
while True:
current_status = None
# Try to get Experiment status.
try:
current_status = katib_client.get_experiment_status(name=experiment_name, namespace=experiment_namespace)
except Exception:
logger.info("Waiting until Experiment is created...")
# If current status is set, exit the loop.
if current_status is not None:
break
# If timeout has been reached, rise an exception.
if datetime.datetime.now() > end_time:
raise Exception("Timout waiting for Experiment: {} in namespace: {} to be created".format(
experiment_name, experiment_namespace))
time.sleep(1)
logger.info("Experiment is created")
# Wait for Experiment finish.
wait_experiment_finish(katib_client, experiment, args.experiment_timeout_minutes)
# Check if Experiment is successful.
if katib_client.is_experiment_succeeded(name=experiment_name, namespace=experiment_namespace):
logger.info("Experiment: {} in namespace: {} is successful".format(
experiment_name, experiment_namespace))
optimal_hp = katib_client.get_optimal_hyperparameters(
name=experiment_name, namespace=experiment_namespace)
logger.info("Optimal hyperparameters:\n{}".format(optimal_hp))
# # Create dir if it doesn't exist.
# if not os.path.exists(os.path.dirname("output.txt")):
# os.makedirs(os.path.dirname("output.txt"))
# Save HyperParameters to the file.
with open("output.txt", 'w') as f:
f.write(json.dumps(optimal_hp))
else:
logger.info("Experiment: {} in namespace: {} is failed".format(
experiment_name, experiment_namespace))
# Print Experiment if it is failed.
experiment = katib_client.get_experiment(name=experiment_name, namespace=experiment_namespace)
logger.info(experiment)
# Delete Experiment if it is needed.
if args.delete_after_done:
katib_client.delete_experiment(name=experiment_name, namespace=experiment_namespace)
logger.info("Experiment: {} in namespace: {} has been deleted".format(
experiment_name, experiment_namespace))
write_time("hyper_parameter_tuning_end", args.time_loc)
Dockerfile
FROM gcr.io/deeplearning-platform-release/tf-gpu.2-8
# installing packages
RUN pip install pandas
RUN pip install gcsfs
RUN pip install google-cloud-storage
RUN pip install pytz
RUN pip install kubernetes
RUN pip install kubeflow-katib
# moving code to preprocess
RUN mkdir /hp_tune
COPY task.py /hp_tune
# CREDENTIAL Authentication
COPY /prj-vertex-ai-2c390f7e8fec.json /hp_tune/prj-vertex-ai-2c390f7e8fec.json
ENV GOOGLE_APPLICATION_CREDENTIALS="/hp_tune/prj-vertex-ai-2c390f7e8fec.json"
# entry point
ENTRYPOINT ["python3", "/hp_tune/task.py"]
gcr.io/.............../hptunekatib:v7
# import os
# os.system("pip install tensorflow-gpu==2.8.0")
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
import os
from tensorflow.keras.layers import Conv1D, MaxPool1D ,Embedding ,concatenate
from tensorflow.keras.layers import Activation, Dropout, Flatten, Dense,Input
from tensorflow.keras.models import Model
from tensorflow import keras
from datetime import datetime
from pytz import timezone
from sklearn.model_selection import train_test_split
import pandas as pd
from google.cloud import storage
import argparse
import logging
logger = logging.getLogger()
logging.basicConfig(level=logging.INFO)
logger.info("Num GPUs Available: " + str(tf.config.list_physical_devices('GPU')))
import subprocess
process = subprocess.Popen(['sh', '-c', 'nvidia-smi'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = process.communicate()
logger.info("NVIDIA SMI " + str(out))
def format_strs(x):
strs = ""
if x > 0:
sign_t = "+"
strs += "+"
else:
sign_t = "-"
strs += "-"
strs = strs + "{:.1e}".format(x)
if "+" in strs[1:]:
sign = "+"
strs = strs[1:].split("+")
else:
sign = "-"
strs = strs[1:].split("-")
last_d = strs[1][1:] if strs[1][0] == "0" else strs[1]
strs_f = sign_t + strs[0] + sign + last_d
return strs_f
def get_args():
'''Parses args. Must include all hyperparameters you want to tune.'''
parser = argparse.ArgumentParser()
parser.add_argument(
'--learning_rate',
required=True,
type=float,
help='learning_rate')
parser.add_argument(
'--batch_size',
required=True,
type=int,
help='batch_size')
parser.add_argument(
'--model_uri',
required=True,
type=str,
help='Model Uri')
args = parser.parse_args()
return args
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
# The ID of your GCS bucket
# bucket_name = "your-bucket-name"
# The ID of your GCS object
# source_blob_name = "storage-object-name"
# The path to which the file should be downloaded
# destination_file_name = "local/path/to/file"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
# Construct a client side representation of a blob.
# Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
# any content from Google Cloud Storage. As we don't need additional data,
# using `Bucket.blob` is preferred here.
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
def create_dataset():
download_blob("faris_bucket_us_central", "Pipeline_data/input_dataset/dbpedia_model/data/" + "train.csv", "train.csv")
trainData = pd.read_csv('train.csv')
trainData.columns = ['label','title','description']
# trainData = trainData.sample(frac=0.002)
X_train, X_test, y_train, y_test = train_test_split(trainData['description'], trainData['label'], stratify=trainData['label'], test_size=0.1, random_state=0)
return X_train, X_test, y_train, y_test
def train_model(train_X, train_y, test_X, test_y, learning_rate, batch_size):
logger.info("Training with lr = " + str(learning_rate) + "bs = " + str(batch_size))
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/2", trainable=False)
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)
# Neural network layers
l = tf.keras.layers.Dropout(0.2, name="dropout")(outputs['pooled_output']) # dropout_rate
l = tf.keras.layers.Dense(14,activation='softmax',kernel_initializer=tf.keras.initializers.GlorotNormal(seed=24))(l) # dense_units
model = Model(inputs=[text_input], outputs=l)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),loss='categorical_crossentropy',metrics=['accuracy'])
history = model.fit(train_X, train_y, epochs=5, validation_data=(test_X, test_y), batch_size=batch_size)
return model, history
def main():
args = get_args()
logger.info("Creating dataset")
train_X, test_X, train_y, test_y = create_dataset()
# one_hot_encoding the class label
encoder = LabelEncoder()
encoder.fit(train_y)
y_train_encoded = encoder.transform(train_y)
y_test_encoded = encoder.transform(test_y)
y_train_ohe = tf.keras.utils.to_categorical(y_train_encoded)
y_test_ohe = tf.keras.utils.to_categorical(y_test_encoded)
logger.info("Training model")
model = train_model(
train_X,
y_train_ohe,
test_X,
y_test_ohe,
args.learning_rate,
int(float(args.batch_size))
)
logger.info("Saving model")
artifact_filename = 'saved_model'
local_path = artifact_filename
tf.saved_model.save(model[0], local_path)
# Upload model artifact to Cloud Storage
model_directory = args.model_uri + "-".join(os.environ["HOSTNAME"].split("-")[:-2]) + "/"
local_path = "saved_model/assets/vocab.txt"
storage_path = os.path.join(model_directory, "assets/vocab.txt")
blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
blob.upload_from_filename(local_path)
local_path = "saved_model/variables/variables.data-00000-of-00001"
storage_path = os.path.join(model_directory, "variables/variables.data-00000-of-00001")
blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
blob.upload_from_filename(local_path)
local_path = "saved_model/variables/variables.index"
storage_path = os.path.join(model_directory, "variables/variables.index")
blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
blob.upload_from_filename(local_path)
local_path = "saved_model/saved_model.pb"
storage_path = os.path.join(model_directory, "saved_model.pb")
blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
blob.upload_from_filename(local_path)
logger.info("Model Saved at " + model_directory)
logger.info("Keras Score: " + str(model[1].history["accuracy"][-1]))
hp_metric = model[1].history["accuracy"][-1]
print("accuracy =", format_strs(hp_metric))
if __name__ == "__main__":
main()
Dockerfile
# FROM gcr.io/deeplearning-platform-release/tf-cpu.2-8
FROM gcr.io/deeplearning-platform-release/tf-gpu.2-8
RUN mkdir -p /opt/trainer
# RUN pip install scikit-learn
RUN pip install tensorflow_text==2.8.1
# RUN pip install tensorflow-gpu==2.8.0
# CREDENTIAL Authentication
COPY /prj-vertex-ai-2c390f7e8fec.json /prj-vertex-ai-2c390f7e8fec.json
ENV GOOGLE_APPLICATION_CREDENTIALS="/prj-vertex-ai-2c390f7e8fec.json"
COPY *.py /opt/trainer/
# # RUN chgrp -R 0 /opt/trainer && chmod -R g+rwX /opt/trainer
# RUN chmod -R 777 /home/trainer
ENTRYPOINT ["python", "/opt/trainer/task.py"]
# Sets up the entry point to invoke the trainer.
# ENTRYPOINT ["python", "-m", "trainer.task"]
The pipeline runs but it doesnot use the GPU and this piece of code
logger.info("Num GPUs Available: " + str(tf.config.list_physical_devices('GPU')))
import subprocess
process = subprocess.Popen(['sh', '-c', 'nvidia-smi'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = process.communicate()
logger.info("NVIDIA SMI " + str(out))
gives empty list and empty string. It is like the GPU doesnot exist. I am attaching the logs of the container
insertId | labels."compute.googleapis.com/resource_name" | labels."k8s-pod/group-name" | labels."k8s-pod/job-name" | labels."k8s-pod/replica-index" | labels."k8s-pod/replica-type" | labels."k8s-pod/training_kubeflow_org/job-name" | labels."k8s-pod/training_kubeflow_org/operator-name" | labels."k8s-pod/training_kubeflow_org/replica-index" | labels."k8s-pod/training_kubeflow_org/replica-type" | logName | receiveLocation | receiveTimestamp | receivedLocation | resource.labels.cluster_name | resource.labels.container_name | resource.labels.location | resource.labels.namespace_name | resource.labels.pod_name | resource.labels.project_id | resource.type | severity | textPayload | timestamp
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
saaah727bfds9ymw | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stdout | 2022-07-11T06:07:35.222632672Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | INFO | accuracy = +9.9e-1 | 2022-07-11T06:07:30.812554270Z
cg5hf72zfi4a8ymi | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stderr | 2022-07-11T06:07:35.218143792Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | ERROR | INFO:root:Num GPUs Available: [] | 2022-07-11T06:07:30.812527036Z
0n32rintpe0v865p | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stderr | 2022-07-11T06:07:35.218143792Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | ERROR | 2022-07-11 06:07:30.811609: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (dbpedia-exp-1-ntq7tfvj-ps-0): /proc/driver/nvidia/version does not exist | 2022-07-11T06:07:30.812519914Z
et3b3w8ji0nlmfc3 | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stderr | 2022-07-11T06:07:35.218143792Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | ERROR | 2022-07-11 06:07:30.811541: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303) | 2022-07-11T06:07:30.812511863Z
u8jhqsnsjg3n114l | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stderr | 2022-07-11T06:07:35.218143792Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | ERROR | 2022-07-11 06:07:30.811461: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load /kind bug
What did you expect to happen:
I expected the pipeline stage to use GPU and run the text classiication using GPU but it doesnt.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
Environment:
- Katib version (check the Katib controller image version): v0.13.0
- Kubernetes version: (
kubectl version): 1.22.8-gke.202 - OS (
uname -a): linux/ COS in containers
Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍
The problem is with the image that you have created. It is not with Katib. Did you use GPU drivers in the image?
I am able to execute "nvidia-smi" on the image and get the correct output. For this to happen, shouldnt the drivers be installed in the image ? Just to be sure, can you provide me with details on how to use GPU drivers in the image ?
You can use Nvidia NGC containers based on your framework https://catalog.ngc.nvidia.com/containers
You can use Nvidia NGC containers based on your framework https://catalog.ngc.nvidia.com/containers
I have tried using the Nvidia NGC containers as mentioned below
FROM nvcr.io/nvidia/tensorflow:22.06-tf2-py3
RUN mkdir -p /opt/trainer
RUN pip show tensorflow
RUN pip install pandas
RUN pip install scikit-learn
RUN pip install google-cloud-storage
# CREDENTIAL Authentication
COPY /prj-vertex-ai-2c390f7e8fec.json /prj-vertex-ai-2c390f7e8fec.json
ENV GOOGLE_APPLICATION_CREDENTIALS="/prj-vertex-ai-2c390f7e8fec.json"
COPY *.py /opt/trainer/
ENTRYPOINT ["python", "/opt/trainer/task.py"]
PS : I have pulled the same image in both the containers of my pipeline
but I am still getting this problem

Also, I have a question. I am setting GPU limit on my pipeline component using .set_gpu_limit(1) as given below.
hp_tune = dsl.ContainerOp(
name='hp-tune-katib',
image=hyper_image_uri,
command=["python3", "/hp_tune/task.py"],
arguments=[
'--experiment_name', experiment_name,
'--experiment_namespace', experiment_namespace,
'--experiment_timeout_minutes', experiment_timeout_minutes,
'--delete_after_done', True,
'--hyper_image_uri', hyper_image_uri_train,
'--time_loc', time_loc,
'--model_uri', model_uri
],
file_outputs={'best-params': '/output.txt'}
).set_gpu_limit(1)
and the ARGO_CONTAINER is showing nvidia.com/gpu : 1

So, my question is that, Do I need to specify GPU request on my trial spec in katib as well like below ?
trial_spec = {
"apiVersion": "kubeflow.org/v1",
"kind": "TFJob",
"spec": {
"tfReplicaSpecs": {
"PS": {
"replicas": 1,
"restartPolicy": "Never",
"template": {
"metadata": {
"annotations": {
"sidecar.istio.io/inject": "false",
}
},
"spec": {
"containers": [
{
"name": "tensorflow",
"image": args.hyper_image_uri,
"command": [
"python",
"/opt/trainer/task.py",
"--model_uri=" + args.model_uri,
"--batch_size=${trialParameters.batchSize}",
"--learning_rate=${trialParameters.learningRate}"
],
"ports" : [
{
"containerPort": 2222,
"name" : "tfjob-port"
}
]
}
]
}
}
},
"Worker": {
"replicas": 1,
"restartPolicy": "Never",
"template": {
"metadata": {
"annotations": {
"sidecar.istio.io/inject": "false",
}
},
"spec": {
"containers": [
{
"name": "tensorflow",
"image": args.hyper_image_uri,
"command": [
"python",
"/opt/trainer/task.py",
"--model_uri=" + args.model_uri,
"--batch_size=${trialParameters.batchSize}",
"--learning_rate=${trialParameters.learningRate}"
],
"ports" : [
{
"containerPort": 2222,
"name" : "tfjob-port"
}
],
"resources" : {
"limits" : {
"nvidia.com/gpu" : 1
}
}
}
]
}
}
}
}
}
}
Also, I kindly request you to help me solve this GPU usage problem.
I haven't tried gpu limit with Pipelines.
Easiest way is to check the experiment yaml using kubectl. Trial Spec should need gpu limit if trial pod needs to access GPU.
This is what happens when I specify GPU request in the trial spec but not in the pipeline component.
This step is in Pending state with this message: Unschedulable: 0/2 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 1 node(s) had taint {nvidia.com/gpu: present}, that the pod didn't tolerate.
This is my kubectl describe node
(base) jupyter@tensorflow-2-6-new:~/katib/dbpedia$ kubectl describe node gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j
Name: gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=n1-highmem-8
beta.kubernetes.io/os=linux
cloud.google.com/gke-accelerator=nvidia-tesla-k80
cloud.google.com/gke-boot-disk=pd-standard
cloud.google.com/gke-container-runtime=containerd
cloud.google.com/gke-cpu-scaling-level=8
cloud.google.com/gke-max-pods-per-node=110
cloud.google.com/gke-nodepool=gpu-pool1
cloud.google.com/gke-os-distribution=cos
cloud.google.com/machine-family=n1
failure-domain.beta.kubernetes.io/region=us-central1
failure-domain.beta.kubernetes.io/zone=us-central1-a
kubernetes.io/arch=amd64
kubernetes.io/hostname=gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j
kubernetes.io/os=linux
node.kubernetes.io/instance-type=n1-highmem-8
topology.gke.io/zone=us-central1-a
topology.kubernetes.io/region=us-central1
topology.kubernetes.io/zone=us-central1-a
Annotations: container.googleapis.com/instance_id: 609271750101604849
csi.volume.kubernetes.io/nodeid:
{"pd.csi.storage.gke.io":"projects/prj-vertex-ai/zones/us-central1-a/instances/gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j"}
node.alpha.kubernetes.io/ttl: 0
node.gke.io/last-applied-node-labels:
cloud.google.com/gke-accelerator=nvidia-tesla-k80,cloud.google.com/gke-boot-disk=pd-standard,cloud.google.com/gke-container-runtime=contai...
node.gke.io/last-applied-node-taints: nvidia.com/gpu=present:NoSchedule
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 15 Jul 2022 08:37:52 +0000
Taints: nvidia.com/gpu=present:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j
AcquireTime: <unset>
RenewTime: Fri, 15 Jul 2022 08:52:28 +0000
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
CorruptDockerOverlay2 False Fri, 15 Jul 2022 08:48:00 +0000 Fri, 15 Jul 2022 08:37:57 +0000 NoCorruptDockerOverlay2 docker overlay2 is functioning properly
FrequentUnregisterNetDevice False Fri, 15 Jul 2022 08:48:00 +0000 Fri, 15 Jul 2022 08:37:57 +0000 NoFrequentUnregisterNetDevice node is functioning properly
FrequentKubeletRestart False Fri, 15 Jul 2022 08:48:00 +0000 Fri, 15 Jul 2022 08:37:57 +0000 NoFrequentKubeletRestart kubelet is functioning properly
FrequentDockerRestart False Fri, 15 Jul 2022 08:48:00 +0000 Fri, 15 Jul 2022 08:37:57 +0000 NoFrequentDockerRestart docker is functioning properly
FrequentContainerdRestart False Fri, 15 Jul 2022 08:48:00 +0000 Fri, 15 Jul 2022 08:37:57 +0000 NoFrequentContainerdRestart containerd is functioning properly
KernelDeadlock False Fri, 15 Jul 2022 08:48:00 +0000 Fri, 15 Jul 2022 08:37:57 +0000 KernelHasNoDeadlock kernel has no deadlock
ReadonlyFilesystem False Fri, 15 Jul 2022 08:48:00 +0000 Fri, 15 Jul 2022 08:37:57 +0000 FilesystemIsNotReadOnly Filesystem is not read-only
NetworkUnavailable False Fri, 15 Jul 2022 08:37:52 +0000 Fri, 15 Jul 2022 08:37:52 +0000 RouteCreated NodeController create implicit route
MemoryPressure False Fri, 15 Jul 2022 08:49:24 +0000 Fri, 15 Jul 2022 08:37:49 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 15 Jul 2022 08:49:24 +0000 Fri, 15 Jul 2022 08:37:49 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 15 Jul 2022 08:49:24 +0000 Fri, 15 Jul 2022 08:37:49 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 15 Jul 2022 08:49:24 +0000 Fri, 15 Jul 2022 08:37:52 +0000 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.128.0.14
ExternalIP: 34.171.4.196
InternalDNS: gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j.us-central1-a.c.prj-vertex-ai.internal
Hostname: gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j.us-central1-a.c.prj-vertex-ai.internal
Capacity:
attachable-volumes-gce-pd: 127
cpu: 8
ephemeral-storage: 98868448Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 53477620Ki
nvidia.com/gpu: 1
pods: 110
Allocatable:
attachable-volumes-gce-pd: 127
cpu: 7910m
ephemeral-storage: 47093746742
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 48425204Ki
nvidia.com/gpu: 1
pods: 110
System Info:
Machine ID: 27109359572b62f3c535daadb9e9c398
System UUID: 27109359-572b-62f3-c535-daadb9e9c398
Boot ID: cb1e0e37-2556-4f81-b0a8-b93a5105f484
Kernel Version: 5.10.90+
OS Image: Container-Optimized OS from Google
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.5.4
Kubelet Version: v1.22.8-gke.202
Kube-Proxy Version: v1.22.8-gke.202
PodCIDR: 10.8.1.0/24
PodCIDRs: 10.8.1.0/24
ProviderID: gce://prj-vertex-ai/us-central1-a/gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j
Non-terminated Pods: (6 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system fluentbit-gke-kjmds 100m (1%) 0 (0%) 200Mi (0%) 500Mi (1%) 14m
kube-system gke-metrics-agent-zqm94 3m (0%) 0 (0%) 50Mi (0%) 50Mi (0%) 14m
kube-system kube-proxy-gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j 100m (1%) 0 (0%) 0 (0%) 0 (0%) 14m
kube-system nvidia-driver-installer-hw2lx 150m (1%) 0 (0%) 0 (0%) 0 (0%) 14m
kube-system nvidia-gpu-device-plugin-ln587 50m (0%) 0 (0%) 50Mi (0%) 50Mi (0%) 14m
kube-system pdcsi-node-2nlmc 10m (0%) 0 (0%) 20Mi (0%) 100Mi (0%) 14m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 413m (5%) 0 (0%)
memory 320Mi (0%) 700Mi (1%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-gce-pd 0 0
nvidia.com/gpu 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 14m kube-proxy
Normal Starting 14m kubelet Starting kubelet.
Normal NodeHasSufficientMemory 14m (x4 over 14m) kubelet Node gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 14m (x4 over 14m) kubelet Node gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 14m (x4 over 14m) kubelet Node gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 14m kubelet Updated Node Allocatable limit across pods
Warning InvalidDiskCapacity 14m kubelet invalid capacity 0 on image filesystem
Normal NodeReady 14m kubelet Node gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j status is now: NodeReady
Warning ContainerdStart 14m (x2 over 14m) systemd-monitor Starting containerd container runtime...
Warning DockerStart 14m (x3 over 14m) systemd-monitor Starting Docker Application Container Engine...
Warning KubeletStart 14m systemd-monitor Started Kubernetes kubelet.
Any idea how I can add toleration to this taint and make the pod allocate GPU ?
This is my pod yaml
(base) jupyter@tensorflow-2-6-new:~/katib/dbpedia/hp_tune$ kubectl get pods kubectl get pod dbpedia-exp-8-g4pvh4fc-worker-0 -o yaml -n kubeflow
apiVersion: v1
items:
- apiVersion: v1
kind: Pod
metadata:
annotations:
sidecar.istio.io/inject: "false"
creationTimestamp: "2022-07-15T09:57:26Z"
labels:
group-name: kubeflow.org
job-name: dbpedia-exp-8-g4pvh4fc
replica-index: "0"
replica-type: worker
training.kubeflow.org/job-name: dbpedia-exp-8-g4pvh4fc
training.kubeflow.org/job-role: master
training.kubeflow.org/operator-name: tfjob-controller
training.kubeflow.org/replica-index: "0"
training.kubeflow.org/replica-type: worker
name: dbpedia-exp-8-g4pvh4fc-worker-0
namespace: kubeflow
ownerReferences:
- apiVersion: kubeflow.org/v1
blockOwnerDeletion: true
controller: true
kind: TFJob
name: dbpedia-exp-8-g4pvh4fc
uid: 7401591a-e7f3-4036-823e-b63437fed795
resourceVersion: "39305"
uid: 5b974f29-4379-41ff-90dd-b51c6d04d189
spec:
containers:
- args:
- python /opt/trainer/task.py --model_uri=gs://faris_bucket_us_central/Pipeline_data/dbpedia_hyper_models/GKE_Katib/
--batch_size=32 --learning_rate=0.004570666890885507 1>/var/log/katib/metrics.log
2>&1 && echo completed > /var/log/katib/$$$$.pid
command:
- sh
- -c
env:
- name: TF_CONFIG
value: '{"cluster":{"ps":["dbpedia-exp-8-g4pvh4fc-ps-0.kubeflow.svc:2222"],"worker":["dbpedia-exp-8-g4pvh4fc-worker-0.kubeflow.svc:2222"]},"task":{"type":"worker","index":0},"environment":"cloud"}'
image: gcr.io/........./hptunekatib:v14
imagePullPolicy: IfNotPresent
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-xvtgc
readOnly: true
- mountPath: /var/log/katib
name: metrics-volume
- args:
- -t
- dbpedia-exp-8-g4pvh4fc
- -m
- accuracy
- -o-type
- maximize
- -s-db
- katib-db-manager.kubeflow:6789
- -path
- /var/log/katib/metrics.log
image: docker.io/kubeflowkatib/file-metrics-collector:v0.13.0
imagePullPolicy: IfNotPresent
name: metrics-logger-and-collector
resources:
limits:
cpu: 500m
ephemeral-storage: 5Gi
memory: 100Mi
requests:
cpu: 50m
ephemeral-storage: 500Mi
memory: 10Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/log/katib
name: metrics-volume
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-xvtgc
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
shareProcessNamespace: true
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: example-key
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
volumes:
- name: kube-api-access-xvtgc
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
- emptyDir: {}
name: metrics-volume
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2022-07-15T09:57:26Z"
message: '0/2 nodes are available: 2 Insufficient nvidia.com/gpu.'
reason: Unschedulable
status: "False"
type: PodScheduled
phase: Pending
qosClass: Burstable
and this is my katib experiment yaml
(base) jupyter@tensorflow-2-6-new:~/katib/dbpedia/hp_tune$ kubectl get experiment dbpedia-exp-8 -o yaml -n kubeflow
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
creationTimestamp: "2022-07-15T09:57:05Z"
finalizers:
- update-prometheus-metrics
generation: 1
name: dbpedia-exp-8
namespace: kubeflow
resourceVersion: "39293"
uid: ded49060-e00e-4b57-8fd1-f40af2ec162e
spec:
algorithm:
algorithmName: random
maxFailedTrialCount: 2
maxTrialCount: 2
metricsCollectorSpec:
collector:
kind: StdOut
objective:
metricStrategies:
- name: accuracy
value: max
objectiveMetricName: accuracy
type: maximize
parallelTrialCount: 1
parameters:
- feasibleSpace:
list:
- "32"
- "42"
- "52"
- "62"
- "64"
name: batch_size
parameterType: discrete
- feasibleSpace:
max: "0.005"
min: "0.001"
name: learning_rate
parameterType: double
resumePolicy: LongRunning
trialTemplate:
failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
primaryContainerName: tensorflow
primaryPodLabels:
training.kubeflow.org/job-role: master
successCondition: status.conditions.#(type=="Succeeded")#|#(status=="True")#
trialParameters:
- description: batch size
name: batchSize
reference: batch_size
- description: Learning rate
name: learningRate
reference: learning_rate
trialSpec:
apiVersion: kubeflow.org/v1
kind: TFJob
spec:
tfReplicaSpecs:
PS:
replicas: 1
restartPolicy: Never
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- command:
- python
- /opt/trainer/task.py
- --model_uri=gs://faris_bucket_us_central/Pipeline_data/dbpedia_hyper_models/GKE_Katib/
- --batch_size=${trialParameters.batchSize}
- --learning_rate=${trialParameters.learningRate}
image: gcr.io/............/hptunekatib:v14
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
Worker:
replicas: 1
restartPolicy: Never
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- command:
- python
- /opt/trainer/task.py
- --model_uri=gs://faris_bucket_us_central/Pipeline_data/dbpedia_hyper_models/GKE_Katib/
- --batch_size=${trialParameters.batchSize}
- --learning_rate=${trialParameters.learningRate}
image: gcr.io/........./hptunekatib:v14
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- effect: NoSchedule
key: example-key
operator: Exists
status:
conditions:
- lastTransitionTime: "2022-07-15T09:57:05Z"
lastUpdateTime: "2022-07-15T09:57:05Z"
message: Experiment is created
reason: ExperimentCreated
status: "True"
type: Created
- lastTransitionTime: "2022-07-15T09:57:26Z"
lastUpdateTime: "2022-07-15T09:57:26Z"
message: Experiment is running
reason: ExperimentRunning
status: "True"
type: Running
currentOptimalTrial:
observation: {}
runningTrialList:
- dbpedia-exp-8-g4pvh4fc
startTime: "2022-07-15T09:57:05Z"
trials: 1
trialsRunning: 1
even though it shows running.. it will timeout eventually.
What am I missing here ?
This is not specific to Katib. It means that trials could not find a node which satisfies these resource requirements to start the pod One thing to note: When you add resource requirements to trial spec, every trial pod will try to request the same set of resources when run in parallel. Eg: If trialSpec has 1 GPU requirement and if experimentSpec allows 3 parallelTrials, then each trial pod will request 1 GPU each(total of 3 GPUs)
Here is the gist of my working sample, you can ignore the node selector stuff, it just helps to schedule the pod on the gpu node I want (dedicated for training in my case) :
trial_spec={
"apiVersion": "batch/v1",
"kind": "Job",
"spec": {
"template": {
"metadata": {
"annotations": {
"sidecar.istio.io/inject": "false"
}
},
"spec": {
"affinity": {
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "k8s.amazonaws.com/accelerator",
"operator": "In",
"values": [
"nvidia-tesla-v100"
]
},
{
"key": "ai-gpu-2",
"operator": "In",
"values": [
"true"
]
}
]
}
]
}
}
},
"containers": [
{
"resources" : {
"limits" : {
"nvidia.com/gpu" : 1
}
},
"name": training_container_name,
"image": "xxxxxxxxxxxxxxxxxxxxx__YOUR_IMAGE_HERE_xxxxxxxxxxxxxx",
"imagePullPolicy": "Always",
"command": train_params + [
"--learning_rate=${trialParameters.learning_rate}",
"--optimizer=${trialParameters.optimizer}",
"--batch_size=${trialParameters.batch_size}",
"--max_epochs=${trialParameters.max_epochs}"
]
}
],
"restartPolicy": "Never",
"serviceAccountName": "default-editor"
}
}
}
}
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Feel free to re-open an issue if you have any followup problems.