bitfusion-with-kubernetes-integration
bitfusion-with-kubernetes-integration copied to clipboard
bitfusion server can't reachable
Hi.
I'm try to integrate bitfusion with k8s.
so I installed bitfusion plugin successfully
but, after create pod(option1 yaml) i got an error.
in front of bitfusion server host ,append - and whitespace
how to fix?
Also seeing this
Hi, could you provide the Bitfusion Server version and the yaml file that you create the pod by?
Can you show me the contents of client. Yaml and servers.conf?
Hey @4everming and @haozheng95
client.yaml
version: ""
ssl: true
cacertpath: /etc/bitfusion/tls/ca.crt
keypath: /etc/bitfusion/tls/bitfusion.key
certpath: /etc/bitfusion/tls/bitfusion.crt
dbcertpath: /etc/bitfusion/tls/ca.crt
min-tls-version: ""
timeout: 3s
authconfig:
enabled: true
username: ""
password: ""
dbusername: ""
dbpassword: ""
token: <redacted>
network: {}
dispatcher:
enabled: true
rdma:
rsocket: true
hops: 2
srs:
auto_release_timeout: 10m
cache_store:
client_root: /root/.bitfusion/cache
client_cleanup_threshold_MB: 5120
server_root: /var/cache/bitfusion
server_cleanup_threshold_MB: 5120
servers.conf
servers:
- addresses:
- 10.202.122.248:5600
Bitfusion server version 3.5.0
The yaml file I used for the pod was https://github.com/vmware/bitfusion-with-kubernetes-integration/blob/main/bitfusion_device_plugin/example/pod.yaml
Hey @pbarker Please make sure your servers.conf file content is correct, bitfusion Server port should be 56001 instead of 5600 This is because the 2.5 client cannot resolve the servers.conf file.
Update the servers.conf file servers.conf
10.202.122.248:56001
Reset the Secret
$ kubectl delete secret -n kube-system bitfusion-secret
$ kubectl delete secret -n tensorflow-benchmark bitfusion-secret
$ kubectl create secret generic bitfusion-secret --from-file=tokens -n kube-system
Then redeploy pod https://github.com/vmware/bitfusion-with-kubernetes-integration/blob/main/bitfusion_device_plugin/example/pod.yaml
That appears to work, thanks @haozheng95! One oddity is when running the example pod I get:
python: can't open file '/benchmark/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py': [Errno 2] No such file or directory
exit status 2
I see thats being mounted from the hostpath, but the script doesn't appear to be there. Any idea what could be happening?
Use the following command on the node where pod is running
- cd /home
- git clone https://github.com/tensorflow/benchmarks.git @pbarker