Open-Assistant
Open-Assistant copied to clipboard
Observability additions
Add new data sources (cadvisor for docker containers, dcgm-exporter for nvidia gpus) Add dashboards for docker / gpus Add mono-board WIP with variable for Datasource and Job
Not sure how to integrate the new containers in the deployment process, please let me know how that works
Our deployment setup starts in https://github.com/LAION-AI/Open-Assistant/blob/main/.github/workflows/deploy-to-node.yaml
which references the files in https://github.com/LAION-AI/Open-Assistant/tree/main/ansible
deploy-to-node
is the main entrypoint for web / backend, and in the inference
subfolder you find the deployment code for inference server and workers.
Do you know what Grafana version those dashboards use/need? I tried to copy them into Grafana v9.5.1 (bc353e4b2d)
version but seemed to be getting error.
https://community.grafana.com/t/recurrent-issue-the-dashboards-has-been-changed-by-someone-else/41349/11
Not sure if might be some other issue or not - was hoping to just copy paste in mono-board.json
to see how it looks.
Do you know what Grafana version those dashboards use/need? I tried to copy them into
Grafana v9.5.1 (bc353e4b2d)
version but seemed to be getting error.https://community.grafana.com/t/recurrent-issue-the-dashboards-has-been-changed-by-someone-else/41349/11
Not sure if might be some other issue or not - was hoping to just copy paste in
mono-board.json
to see how it looks.
It'll be easier to just import it as a new dashboard, rather than pasting it into an existing one. I was using a freshly pulled grafana/grafana docker image
I was trying to run in a codespace on this branch and got this:
docker compose --profile ci --profile observability up --build --attach-dependencies
[+] Running 11/0
✔ Container cadvisor Created 0.0s
✔ Container open-assistant-webdb-1 Running 0.0s
✔ Container netdata Running 0.0s
✔ Container prometheus Running 0.0s
✔ Container open-assistant-web-1 Created 0.0s
✔ Container open-assistant-maildev-1 Running 0.0s
✔ Container open-assistant-redis-1 Running 0.0s
✔ Container open-assistant-db-1 Running 0.0s
✔ Container open-assistant-backend-worker-1 Created 0.0s
✔ Container open-assistant-backend-1 Created 0.0s
✔ Container open-assistant-backend-worker-beat-1 Created 0.0s
Attaching to cadvisor, grafana, netdata, open-assistant-backend-1, open-assistant-backend-worker-1, open-assistant-backend-worker-beat-1, open-assistant-db-1, open-assistant-maildev-1, open-assistant-redis-1, open-assistant-web-1, open-assistant-webdb-1, prometheus
Error response from daemon: driver failed programming external connectivity on endpoint grafana (1020f4e9bb3ab4338326c4012ed943aa4189d5bf72be852f5225b8e5359915bb): Error starting userland proxy: listen tcp4 0.0.0.0:2000: bind: address already in use
Not sure what might already be using port 2000.
I was trying to run in a codespace on this branch and got this:
docker compose --profile ci --profile observability up --build --attach-dependencies
[+] Running 11/0 ✔ Container cadvisor Created 0.0s ✔ Container open-assistant-webdb-1 Running 0.0s ✔ Container netdata Running 0.0s ✔ Container prometheus Running 0.0s ✔ Container open-assistant-web-1 Created 0.0s ✔ Container open-assistant-maildev-1 Running 0.0s ✔ Container open-assistant-redis-1 Running 0.0s ✔ Container open-assistant-db-1 Running 0.0s ✔ Container open-assistant-backend-worker-1 Created 0.0s ✔ Container open-assistant-backend-1 Created 0.0s ✔ Container open-assistant-backend-worker-beat-1 Created 0.0s Attaching to cadvisor, grafana, netdata, open-assistant-backend-1, open-assistant-backend-worker-1, open-assistant-backend-worker-beat-1, open-assistant-db-1, open-assistant-maildev-1, open-assistant-redis-1, open-assistant-web-1, open-assistant-webdb-1, prometheus Error response from daemon: driver failed programming external connectivity on endpoint grafana (1020f4e9bb3ab4338326c4012ed943aa4189d5bf72be852f5225b8e5359915bb): Error starting userland proxy: listen tcp4 0.0.0.0:2000: bind: address already in use
Not sure what might already be using port 2000.
You can try following some of these to find what is using it: https://www.cyberciti.biz/faq/unix-linux-check-if-port-is-in-use-command/
Alternatively, change Open-Assistant/docker-compose.yaml:298 to - <free port in codespace>:2000
i am hoping to get this merged so i can just use a github codespace then to test this properly (just to make sure anyone can run it via a clean github codespace) :
https://github.com/LAION-AI/Open-Assistant/pull/2970
This PR contains quite a lot of elements for which I don't see a clear use case. For example why is there nvidia-gpus.json ? We don't have any GPUs in our backend server ..
I general we need a proper plan for monitoring. At least Grafana should be deployed to a separate machine to monitor the health of the system (and send out alerts).
@andreaskoepf We do need to discuss things properly.
I was in the process of writing proper dockerfiles and updating the github actions / ansible for all these new containers yesterday, but I had to pause that and was afk for the rest of the day.
I have added two new containers, cadvisor
and dcgm-exporter
.
cadvisor will allow monitoring of docker containers (cpu, memory, network, etc), and should be deployed on any server that is running docker containers
dcgm-exporter gives stats for nvidia gpus, and should be deployed onto the inference worker servers so that the gpus can be monitored for utilisation and vram usage (or anything else that would be useful), I mentioned this in the discord:
Visibility is always good, if prometheus is going to be scraping the gpu instances for the worker metrics anyway.. might as well pick up the GPU usage That way if something starts hanging the gpus can be checked, and if a new model is released it can be used to monitor the vram/utilisation for testing more optimised models
I am closing this since we haven't any clear use case for this right now.