Open-Assistant Observability additions

Add new data sources (cadvisor for docker containers, dcgm-exporter for nvidia gpus) Add dashboards for docker / gpus Add mono-board WIP with variable for Datasource and Job

Apr 28 '23 21:04 toiletpapercode

Not sure how to integrate the new containers in the deployment process, please let me know how that works

Apr 28 '23 21:04 toiletpapercode

Our deployment setup starts in https://github.com/LAION-AI/Open-Assistant/blob/main/.github/workflows/deploy-to-node.yaml

which references the files in https://github.com/LAION-AI/Open-Assistant/tree/main/ansible

deploy-to-node is the main entrypoint for web / backend, and in the inference subfolder you find the deployment code for inference server and workers.

Apr 29 '23 05:04 AbdBarho

Do you know what Grafana version those dashboards use/need? I tried to copy them into Grafana v9.5.1 (bc353e4b2d) version but seemed to be getting error.

https://community.grafana.com/t/recurrent-issue-the-dashboards-has-been-changed-by-someone-else/41349/11

Not sure if might be some other issue or not - was hoping to just copy paste in mono-board.json to see how it looks.

Apr 29 '23 10:04 andrewm4894

Do you know what Grafana version those dashboards use/need? I tried to copy them into Grafana v9.5.1 (bc353e4b2d) version but seemed to be getting error.

https://community.grafana.com/t/recurrent-issue-the-dashboards-has-been-changed-by-someone-else/41349/11

Not sure if might be some other issue or not - was hoping to just copy paste in mono-board.json to see how it looks.

It'll be easier to just import it as a new dashboard, rather than pasting it into an existing one. I was using a freshly pulled grafana/grafana docker image

Apr 29 '23 10:04 toiletpapercode

I was trying to run in a codespace on this branch and got this:

docker compose --profile ci --profile observability up --build --attach-dependencies

[+] Running 11/0
 ✔ Container cadvisor                              Created                                                                                                0.0s 
 ✔ Container open-assistant-webdb-1                Running                                                                                                0.0s 
 ✔ Container netdata                               Running                                                                                                0.0s 
 ✔ Container prometheus                            Running                                                                                                0.0s 
 ✔ Container open-assistant-web-1                  Created                                                                                                0.0s 
 ✔ Container open-assistant-maildev-1              Running                                                                                                0.0s 
 ✔ Container open-assistant-redis-1                Running                                                                                                0.0s 
 ✔ Container open-assistant-db-1                   Running                                                                                                0.0s 
 ✔ Container open-assistant-backend-worker-1       Created                                                                                                0.0s 
 ✔ Container open-assistant-backend-1              Created                                                                                                0.0s 
 ✔ Container open-assistant-backend-worker-beat-1  Created                                                                                                0.0s 
Attaching to cadvisor, grafana, netdata, open-assistant-backend-1, open-assistant-backend-worker-1, open-assistant-backend-worker-beat-1, open-assistant-db-1, open-assistant-maildev-1, open-assistant-redis-1, open-assistant-web-1, open-assistant-webdb-1, prometheus
Error response from daemon: driver failed programming external connectivity on endpoint grafana (1020f4e9bb3ab4338326c4012ed943aa4189d5bf72be852f5225b8e5359915bb): Error starting userland proxy: listen tcp4 0.0.0.0:2000: bind: address already in use

Not sure what might already be using port 2000.

Apr 29 '23 11:04 andrewm4894

I was trying to run in a codespace on this branch and got this:

docker compose --profile ci --profile observability up --build --attach-dependencies

[+] Running 11/0
 ✔ Container cadvisor                              Created                                                                                                0.0s 
 ✔ Container open-assistant-webdb-1                Running                                                                                                0.0s 
 ✔ Container netdata                               Running                                                                                                0.0s 
 ✔ Container prometheus                            Running                                                                                                0.0s 
 ✔ Container open-assistant-web-1                  Created                                                                                                0.0s 
 ✔ Container open-assistant-maildev-1              Running                                                                                                0.0s 
 ✔ Container open-assistant-redis-1                Running                                                                                                0.0s 
 ✔ Container open-assistant-db-1                   Running                                                                                                0.0s 
 ✔ Container open-assistant-backend-worker-1       Created                                                                                                0.0s 
 ✔ Container open-assistant-backend-1              Created                                                                                                0.0s 
 ✔ Container open-assistant-backend-worker-beat-1  Created                                                                                                0.0s 
Attaching to cadvisor, grafana, netdata, open-assistant-backend-1, open-assistant-backend-worker-1, open-assistant-backend-worker-beat-1, open-assistant-db-1, open-assistant-maildev-1, open-assistant-redis-1, open-assistant-web-1, open-assistant-webdb-1, prometheus
Error response from daemon: driver failed programming external connectivity on endpoint grafana (1020f4e9bb3ab4338326c4012ed943aa4189d5bf72be852f5225b8e5359915bb): Error starting userland proxy: listen tcp4 0.0.0.0:2000: bind: address already in use

Not sure what might already be using port 2000.

You can try following some of these to find what is using it: https://www.cyberciti.biz/faq/unix-linux-check-if-port-is-in-use-command/

Alternatively, change Open-Assistant/docker-compose.yaml:298 to - <free port in codespace>:2000

Apr 29 '23 11:04 toiletpapercode

i am hoping to get this merged so i can just use a github codespace then to test this properly (just to make sure anyone can run it via a clean github codespace) :

https://github.com/LAION-AI/Open-Assistant/pull/2970

Apr 29 '23 11:04 andrewm4894

This PR contains quite a lot of elements for which I don't see a clear use case. For example why is there nvidia-gpus.json ? We don't have any GPUs in our backend server ..

Apr 30 '23 09:04 andreaskoepf

I general we need a proper plan for monitoring. At least Grafana should be deployed to a separate machine to monitor the health of the system (and send out alerts).

Apr 30 '23 09:04 andreaskoepf

@andreaskoepf We do need to discuss things properly.

I was in the process of writing proper dockerfiles and updating the github actions / ansible for all these new containers yesterday, but I had to pause that and was afk for the rest of the day.

I have added two new containers, cadvisor and dcgm-exporter.

cadvisor will allow monitoring of docker containers (cpu, memory, network, etc), and should be deployed on any server that is running docker containers

dcgm-exporter gives stats for nvidia gpus, and should be deployed onto the inference worker servers so that the gpus can be monitored for utilisation and vram usage (or anything else that would be useful), I mentioned this in the discord:

Visibility is always good, if prometheus is going to be scraping the gpu instances for the worker metrics anyway.. might as well pick up the GPU usage That way if something starts hanging the gpus can be checked, and if a new model is released it can be used to monitor the vram/utilisation for testing more optimised models

Apr 30 '23 10:04 toiletpapercode

I am closing this since we haven't any clear use case for this right now.

Jun 07 '23 12:06 andreaskoepf

Open-Assistant Open-Assistant copied to clipboard

Observability additions

Open-Assistant
Open-Assistant copied to clipboard