FfDL icon indicating copy to clipboard operation
FfDL copied to clipboard

dind-port-forward.sh -> invalid resource name ?

Open stock99 opened this issue 7 years ago • 5 comments

if i execute the script, I will get error look similar below: root@ffdl2018:~/FfDL/bin# kubectl port-forward pod/$ui_pod $ui_port:8080 error: invalid resource name "pod/": [may not contain '/']

So I tried to remove the pod/ thinking maybe newer version of kubeadmin-dind look like the pod/ , but i get different error below. Can someone help me with the error message below?

Forwarding from 127.0.0.1:31300 -> 8080 Handling connection for 30029 E1031 14:22:28.129745 48277 portforward.go:331] an error occurred forwarding 30029 -> 3000: error forwarding port 3000 to pod 47707ef93dfd507f6f14e9f8adb03b26857f292357fd6102877eef2b52e8a554, uid : exit status 1: 2018/10/31 03:22:28 socat[11424] E connect(5, AF=2 127.0.0.1:3000, 16): Connection refused Handling connection for 30029 E1031 14:22:30.160553 48277 portforward.go:331] an error occurred forwarding 30029 -> 3000: error forwarding port 3000 to pod 47707ef93dfd507f6f14e9f8adb03b26857f292357fd6102877eef2b52e8a554, uid : exit status 1: 2018/10/31 03:22:30 socat[11441] E connect(5, AF=2 127.0.0.1:3000, 16): Connection refused Handling connection for 30029 E1031 14:22:32.191360 48277 portforward.go:331] an error occurred forwarding 30029 -> 3000: error forwarding port 3000 to pod 47707ef93dfd507f6f14e9f8adb03b26857f292357fd6102877eef2b52e8a554, uid : exit status 1: 2018/10/31 03:22:32 socat[11492] E connect(5, AF=2 127.0.0.1:3000, 16): Connection refused Handling connection for 30029 E1031 14:22:34.225286 48277 portforward.go:331] an error occurred forwarding 30029 -> 3000: error forwarding port 3000 to pod 47707ef93dfd507f6f14e9f8adb03b26857f292357fd6102877eef2b52e8a554, uid : exit status 1: 2018/10/31 03:22:34 socat[11493] E connect(5, AF=2 127.0.0.1:3000, 16): Connection refused Handling connection for 30029 creating data source... Handling connection for 30029 set up dashboards Handling connection for 30029 Finished

stock99 avatar Oct 31 '18 03:10 stock99

Hi @stock99, it looks like the script didn't find the right pod name from your Kubernetes cluster. Can you echo your pod name with the below commands? Thanks.

ui_pod=$(kubectl get pods | grep ffdl-ui | awk '{print $1}')
restapi_pod=$(kubectl get pods | grep ffdl-restapi | awk '{print $1}')
grafana_pod=$(kubectl get pods | grep prometheus | awk '{print $1}')

echo $ui_pod
echo $restapi_pod
echo $grafana_pod

Also, the pod/ format was introduce from kubectl client v1.10.0 and above, so I would recommend to update your kubectl client to a version after v1.10.0.

Tomcli avatar Oct 31 '18 16:10 Tomcli

Hi Tomcli, It looks like the kubectl come with kubeadm-dind installation script isn't the latest one (1.8.x). If i installed the latest version via snap, the installation script there seem to enforce the use of 1.8.15 still. Should I adjust any environment variable?

echo $ui_pod ffdl-ui-b6cbb98f-c4zpm echo $restapi_pod ffdl-restapi-84bcb74478-t8df6 echo $grafana_pod prometheus-5f85fd7695-gb568

kubectl version Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.15", GitCommit:"c2bd642c70b3629223ea3b7db566a267a1e2d0df", GitTreeState:"clean", BuildDate:"2018-07-11T17:59:56Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.15", GitCommit:"c2bd642c70b3629223ea3b7db566a267a1e2d0df", GitTreeState:"clean", BuildDate:"2018-07-11T17:52:15Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

snap list Name Version Rev Tracking Publisher Notes aws-cli 1.15.71 135 stable aws✓ classic core 16-2.35.5 5742 stable canonical✓ core helm 2.11.0 63 stable snapcrafters classic kubectl 1.12.1 462 stable canonical✓ classic

stock99 avatar Nov 01 '18 00:11 stock99

Hi @stock99, I updated the script at #150 to make it able to run with K8S 1.8.x. Let me know if you encounter any new issue.

Tomcli avatar Nov 01 '18 16:11 Tomcli

seem to be ok now after removing 'pod/' in the script. The connection error in the opening post was because I fat-fingered on one of the export statement in dind installation.

But then I got an error message for the test routine make test-push-data-s3 && make test-job-submit : Getting all models ... Handling connection for 32060 ID Name Framework Training status Submitted Completed

0 records found. Makefile:213: recipe for target 'test-job-submit' failed make: *** [test-job-submit] Error 1

====== attached is the console log error_log.txt

stock99 avatar Nov 06 '18 01:11 stock99

Anyone can help? I got this error messages when running the make test-job-submit

Downloading Docker images and test training data. This may take a while.
Context "dind" modified.
error: there is no need to specify a resource type as a separate argument when passing arguments in resource/name form (e.g. 'kubectl get resource/<resource_name>' instead of 'kubectl get resource resource/<resource_name>'
Submitting example training job (tf-model)
S3 URL: http://:30381 REST URL: http://localhost:31961
Executing in etc/examples/tf-model: DLAAS_URL=http://localhost:31961 DLAAS_USERNAME=test-user DLAAS_PASSWORD=test /home/chris/FfDL/cli/bin/ffdl-linux train manifest.yml .
sed: can't read : No such file or directory
name: tf_convolutional_network_tutorial
description: Convolutional network model using tensorflow
version: "1.0"
gpus: 0
cpus: 0.5
memory: 1Gb
learners: 1

# Object stores that allow the system to retrieve training data.
data_stores:
  - id: sl-internal-os
    type: mount_cos
    training_data:
      container: tf_training_data
    training_results:
      container: tf_trained_model
    connection:
      auth_url: http://10.192.0.3:30417
      user_name: test
      password: test

framework:
  name: tensorflow
  version: "1.5.0-py3"
  command: >
    python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz
      --trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz
      --testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001
      --trainingIters 2000
  # Change trainingIters to 20000 if you want your model to have over 80% Accuracy rate.

evaluation_metrics:
  type: tensorboard
  in: "$JOB_STATE_DIR/logs/tb"
  # (Eventual) Available event types: 'images', 'distributions', 'histograms', 'images'
  # 'audio', 'scalars', 'tensors', 'graph', 'meta_graph', 'run_metadata'
  #  event_types: [scalars]
/home/chris/FfDL/etc/examples/tf-model
Deploying model with manifest 'manifest_testrun.yml' and model files in '.'...
Handling connection for 31961
Handling connection for 31961
FAILED
Error 200: OK

Test job submitted. Track the status via "DLAAS_URL=http://localhost:31961 DLAAS_USERNAME=test-user DLAAS_PASSWORD=test /home/chris/FfDL/cli/bin/ffdl-linux list".

chengboonrong avatar Apr 02 '19 04:04 chengboonrong