bitfusion-with-kubernetes-integration icon indicating copy to clipboard operation
bitfusion-with-kubernetes-integration copied to clipboard

bitfusion server can't reachable

Open Goseokbin opened this issue 3 years ago • 7 comments

Hi. I'm try to integrate bitfusion with k8s. so I installed bitfusion plugin successfully but, after create pod(option1 yaml) i got an error. in front of bitfusion server host ,append - and whitespace how to fix? 무제

Goseokbin avatar Jul 02 '21 09:07 Goseokbin

Also seeing this

pbarker avatar Jul 27 '21 00:07 pbarker

Hi, could you provide the Bitfusion Server version and the yaml file that you create the pod by?

4everming avatar Aug 05 '21 05:08 4everming

Can you show me the contents of client. Yaml and servers.conf?

haozheng95 avatar Aug 05 '21 08:08 haozheng95

Hey @4everming and @haozheng95

client.yaml

version: ""
ssl: true
cacertpath: /etc/bitfusion/tls/ca.crt
keypath: /etc/bitfusion/tls/bitfusion.key
certpath: /etc/bitfusion/tls/bitfusion.crt
dbcertpath: /etc/bitfusion/tls/ca.crt
min-tls-version: ""
timeout: 3s
authconfig:
  enabled: true
  username: ""
  password: ""
  dbusername: ""
  dbpassword: ""
  token: <redacted>
network: {}
dispatcher:
  enabled: true
  rdma:
    rsocket: true
    hops: 2
  srs:
    auto_release_timeout: 10m
  cache_store:
    client_root: /root/.bitfusion/cache
    client_cleanup_threshold_MB: 5120
    server_root: /var/cache/bitfusion
    server_cleanup_threshold_MB: 5120

servers.conf

servers:
- addresses:
  - 10.202.122.248:5600

Bitfusion server version 3.5.0

The yaml file I used for the pod was https://github.com/vmware/bitfusion-with-kubernetes-integration/blob/main/bitfusion_device_plugin/example/pod.yaml

pbarker avatar Aug 09 '21 18:08 pbarker

Hey @pbarker Please make sure your servers.conf file content is correct, bitfusion Server port should be 56001 instead of 5600 This is because the 2.5 client cannot resolve the servers.conf file.

Update the servers.conf file servers.conf

10.202.122.248:56001

Reset the Secret

$ kubectl delete secret -n kube-system bitfusion-secret  
$ kubectl delete secret -n tensorflow-benchmark  bitfusion-secret  
$ kubectl create secret generic bitfusion-secret --from-file=tokens -n kube-system

Then redeploy pod https://github.com/vmware/bitfusion-with-kubernetes-integration/blob/main/bitfusion_device_plugin/example/pod.yaml

haozheng95 avatar Aug 10 '21 07:08 haozheng95

That appears to work, thanks @haozheng95! One oddity is when running the example pod I get:

python: can't open file '/benchmark/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py': [Errno 2] No such file or directory
exit status 2

I see thats being mounted from the hostpath, but the script doesn't appear to be there. Any idea what could be happening?

pbarker avatar Aug 12 '21 22:08 pbarker

Use the following command on the node where pod is running

  1. cd /home
  2. git clone https://github.com/tensorflow/benchmarks.git @pbarker

haozheng95 avatar Sep 09 '21 06:09 haozheng95