envd feat: Collect metrics in the environment and build monitor dashboard

Description

Support collecting related metrics (cpu/gpu/memory/disk) and demonstrate it to users

Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

Jul 26 '22 10:07 VoVAllen

I think we could do a exporter. But we need to have a good design. Because I think we should reduce the difficulty for users to ananlyze the data but could know what he could do to improve the effiency or progress in a look. Another thing I cared about is that should we also monitor the model or computing flow in the application users runs?

Jul 26 '22 11:07 aseaday

Sure. We're considering something like a probe, that can also be integrated with the existing observability platform.

For the second question, what kind of monitoring are you referring to?

Jul 26 '22 16:07 VoVAllen

Let me state it clearly. For the first quesion, I think we should better also provide a simple all-in-one index such as a user's computing device is busy or free. For the second question, I developped some tools like Jaeger to trace our machine learning application's work flow. For a app running in envd, could we provide a python agent or sidecar to detect the computiing ops in some famlilar framework. Or we just offer a standard format and display the computing operator flow but users could develop the probe to find the computing graph for the framework they are using.

Jul 27 '22 02:07 aseaday

@aseaday I think generating a computation graph from a running application is still an open problem, which cannot be done in a non-intrusive way. An alternative way is probably using nsight system to collect running metrics (we can provide envd script for this), that user can see the execution time of each operator. Does this work in your scenario?

Jul 27 '22 06:07 VoVAllen

Nsight works for me but seems a little complex for ordinary data scientist like our target audience. Maybe we should have a talk with our target users. What they want to learn from a minitor?

Jul 27 '22 14:07 aseaday

Yep, we can. But I am not sure if they can tell us about it. They may not know what metrics are helpful for them too.

Jul 28 '22 01:07 gaocegege

Some research: https://github.com/tensorchord/envd/issues/218

Jul 29 '22 03:07 gaocegege

Some research: #218

agent is better for general purpose use.

Jul 29 '22 03:07 aseaday

I thought we should a better design for how to preprocess and show the metrics to the users. Otherwise, it is not meaningful to give a grafana.

Jul 29 '22 03:07 aseaday

Yep, I think so. We need a proposal for the feature.

Jul 29 '22 03:07 gaocegege

I will write the proposal and do basic exporter this week.

Aug 01 '22 15:08 aseaday

why not do our jobs based on the https://github.com/prometheus/node_exporter?

Aug 09 '22 02:08 zwpaper

Good question. I am thinking a lot about observability during work on top command. We could divided the metrics tools into there types:

frontend: grafana and top
intermiddiate layer: promotheus or some simple memory storage
backend: collectors like exporter

Expoter is a good mechinism in promotheus environment. Buf for envd users, a hard problem is that:

How much effort it cost to run and learn a grafana or promotheus

Our users may only need a few simple indicators and quickly know the processing status. So the hard part is not how to collect all metrics as many as possible. It is about:

What metrics are important to envd users
How to show them easily

Hope it could be helpful for you about why I give up the exporter when writing top.

Aug 09 '22 03:08 aseaday

@aseaday thanks so much for the quick reply! I totally get your point for writing a brand new top for envd.

but Prometheus stack also has some attractive advantages, like:

metrics history
UI would be much more user-friendly
extendability with many exporters already existed

the main concern is that it may cost too many resources and the learning curve, but

we have a proposal for k8s https://github.com/tensorchord/envd/pull/303, it may be a good choice if we have an existed Prometheus and Grafana to use.

what's more, we are considering the Observability https://github.com/tensorchord/envd/issues/151, then I long time storage like prometheus would be much more useful

Aug 09 '22 03:08 zwpaper

I think prom works for us while we may need our own exporter. Because we may want to collect more metrics other than GPU/CPU hardware metrics.

Thus I am thinking if we should introduce a separate daemon process in the container to act as the exporter.

Aug 09 '22 03:08 gaocegege

@zwpaper @gaocegege Before we introduce prom into our user flow. The cost and method of bootstrap prom should be considered. Should we start a prom daemon? And where the prom should be running, docker/runc/k8s?. We know our users may don't develop apps on server and close their compute/laptop off work. It is also something prom may be not designed for.

As for exporter, I support exporter as a metrics collector inside the container for indicator can't be collected from the container runtime endpoint. Otherwise, we should merge the exporters from all containers and export a unified exporter or auto register to prom. But for k8s runtime, I am not sure the latter could be possible for runtime like k8s.

Aug 09 '22 03:08 aseaday

actually, it we do not care about the HA part, which a dev env should be ok without HA, a prom could be easily run with one command. the resources cost would sometimes be a problem, but if we consider the use case, that users run there envs in cloud or shared a bear metal, the cost may become not that heavy.

as for exporter part, exporter is designed to be low cost, it should be ok to run one in each container, and this would make the problem easy to solve.

PS, a brain storm came up, should envd develop something like docker-compose to create multiple containers in a time😂

Aug 09 '22 05:08 zwpaper

we may need our own exporter.

@gaocegege totally agree, this is why I said that built one based on node exporter.

Aug 09 '22 05:08 zwpaper

BTW, @aseaday although the discussion focused on the prom part, top is great! and it should always be one of the tools we present to users.

just thinking about whether can we do something more.

Aug 09 '22 05:08 zwpaper

BTW, @aseaday although the discussion focused on the prom part, top is great! and it should always be one of the tools we present to users.

just thinking about whether can we do something more.

the current framework is open to use prom. But we could add prom daemon in bootstrap first.

Aug 09 '22 05:08 aseaday

I can handle the prom related, but we must stay on the same page before starting. I will open a proposal later to describe and discuss how we do prom

Aug 09 '22 06:08 zwpaper

@zwpaper Maybe we can discuss it further in discord to make sure that we are on the same page.

Aug 09 '22 06:08 gaocegege

I would recommend https://github.com/VictoriaMetrics/VictoriaMetrics as a way to store the metrics.

But I'm still not very sure if all the metrics we need can be stored in a way like Prometheus.

Aug 09 '22 06:08 kemingy

I think it should be push-based in envd. I do not know if prom push-gateway is mature now.

Aug 09 '22 07:08 gaocegege

@zwpaper Would you mind joining our discord and discuss it in #envd-dev?

Aug 09 '22 07:08 gaocegege

https://discord.com/invite/KqswhpVgdU

Aug 09 '22 07:08 gaocegege

PS, a brain storm came up, should envd develop something like docker-compose to create multiple containers in a timejoy

It works if the "sidecar" container shares the same PID namespace with the envd containers. Then we can get the process info in it.

Aug 09 '22 07:08 gaocegege

Also related to #13

Aug 09 '22 07:08 kemingy

Wow! @kemingy thanks so much for bringing VictoriaMetrics/VictoriaMetrics, it really seems to be a great project and a good replacement for prom!

Aug 09 '22 09:08 zwpaper

It is the whole landscape I think the envd need. There are some point need to watch out.

As @kemingy said, there is not only time series numberic data to be collected. The data may like graph to describe the computing flow shoud also be considered.
I still think we could give a analytic result for users to know simplily such as the disk IO is slower and influence the overall performance so we need a analytics feature too.

Aug 09 '22 10:08 aseaday

envd envd copied to clipboard

feat: Collect metrics in the environment and build monitor dashboard

Description

envd
envd copied to clipboard