kepler icon indicating copy to clipboard operation
kepler copied to clipboard

implement model-based power estimator

Open sunya-ch opened this issue 2 years ago • 14 comments

This PR introduces a dynamic way to estimate the power by Estimator class (pkg/model/estimator.go).

  • the model is supposed to be dynamically downloaded to the folder data/model
  • python program running as a child process to apply the trained model to the read value via unix domain socket
  • model class is implemented in python now supporting .h5 of keras model, .sav of scikit-learn model, and simple ratio model computed metric importance by correlation to power

There are additional three dependent points to integrate this class to the Kepler

  1. initialize in exporter.go
errCh := make(chan error)
estimator := &model.Estimator{
   Err: errCh,
}
// start python program (pkg/model/py/estimator.py) 
// it will listen for PowerRequest by the unix domain socket "/tmp/estimator.sock"
go estimator.StartPyEstimator()
defer estimator.Destroy()
  1. call GetPower function in reader.go
// it will create PowerRequest and send to estimator.py via the unix domain socket
(e *Estimator) GetPower(modelName string, xCols []string, xValues [][]float32, corePower, dramPower, gpuPower, otherPower []float32) []float32 {} 
  • modelName refers to the model folder in /data/model which contains metadata.json giving the rest details of model such as model file, feature engineering pkl files, features, error, so on. (auto-select the minimum error model if it is empty, "")
  • xCols refers to features
  • xValues refers to values of each features for each pods [no. pods x no. features]
  • corePower refers to core power for each package (leave it empty if not available)
  • dramPower, gpuPower, otherPower same to corePower
  1. put initial models to data/model of container folder (can be done by statically add in the docker image or deployment manifest volumes)

check example use in pkg/model/estimator_test.go

If you are agree with this direction, we can modify estimator.py to

  • support other modeling classes
  • select the applicable features from available features
  • connect to kepler-model-server to update the model

Signed-off-by: Sunyanan Choochotkaew [email protected]

sunya-ch avatar Aug 05 '22 13:08 sunya-ch

thank you @sunya-ch for this impressive work!

wonder how much cpu and memory the estimator will consume, do you have any data?

rootfs avatar Aug 05 '22 13:08 rootfs

That is great, I really want to have different Power Models, specially bring back the Power Model based on Ratio

marceloamaral avatar Aug 05 '22 13:08 marceloamaral

That is great, I really want to have different Power Models, specially bring back the Power Model based on Ratio

In the current implementation, I treat the ratio approach same to the trained approach considered it as a model. With this way, you can dynamically update the importance of the ratio metric for example when you find higher correlated metric. What do you think?

sunya-ch avatar Aug 05 '22 14:08 sunya-ch

I'm wondering if instead of calling the python code, we could have a micro-service running as Power Model Server (which will be running the python code) and we access it through an API (http or grpc)... That way, we can enforce a good design pattern using APIs to communicate between different programming languages.

@rootfs was creating an external server to do something like this right? A server that receives some data, does some calculations, and responds to some information. This server can be a container that expose an API to receive the data and return the energy consumption.

marceloamaral avatar Aug 05 '22 14:08 marceloamaral

thank you @sunya-ch for this impressive work!

wonder how much cpu and memory the estimator will consume, do you have any data?

I have no experimental data yet. It passes the feature values through the unix domain socket and just apply the mathematical model to it for estimation. The training process is not included here.

sunya-ch avatar Aug 05 '22 14:08 sunya-ch

We will also need some documentation, describing how to configure the models and with details about the supported models

marceloamaral avatar Aug 05 '22 14:08 marceloamaral

I'm wondering if instead of calling the python code, we could have a micro-service running as Power Model Server (which will be running the python code) and we access it through an API (http or grpc)... That way, we can enforce a good design pattern using APIs to communicate between different programming languages.

@rootfs was creating an external server to do something like this right? A server that receives some data, does some calculations, and responds to some information. This server can be a container that expose an API to receive the data and return the energy consumption.

The external server is for training the model which not all data should be sent. This module is called for prediction for every read data. I think it might be better to use local socket instead of going through the networks for microservice.

sunya-ch avatar Aug 05 '22 14:08 sunya-ch

If we need to use python, I will strongly argue to have it running as a different service

Otherwise, we could consider using a machine learning library in golang https://upstack.co/knowledge/golang-machine-learning

marceloamaral avatar Aug 05 '22 14:08 marceloamaral

If we need to use python, I will strongly argue to have it running as a different service

Otherwise, we could consider using a machine learning library in golang https://upstack.co/knowledge/golang-machine-learning

I think at this step, unix domain socket should be fine because it is just to apply the trained method (it is not going to do training or fancy thing rather than read the trained weight and do multiplications to the read data). Migrating to golang is a big task and I don't think it will have that much impact. We can put it in the future work if all sets.

I will evaluate the end-to-end power estimation time per one tick.

sunya-ch avatar Aug 05 '22 14:08 sunya-ch

If we need to use python, I will strongly argue to have it running as a different service

Otherwise, we could consider using a machine learning library in golang https://upstack.co/knowledge/golang-machine-learning

+1 that

The estimator python may have its own repo and run as a sidecar, so we don't have to upgrade the kepler container image if the estimator changes.

rootfs avatar Aug 05 '22 15:08 rootfs

I amended the commit by

  • separating estimator for building another Docker image to run as a sidecar container. To build the image, run make build-estimator The new image will be quay.io/sustainable_computing_io/kepler-estimator:latest
  • changing variable name of xCols and xValues.

TO-DO:

  • [ ] measure overhead
  • [ ] update reader.go to use GetPower
  • [ ] add sidecar container to kepler deployment

sunya-ch avatar Aug 08 '22 11:08 sunya-ch

sounds good, I just created a repo there for your next push quay.io/sustainable_computing_io/kepler-estimator

rootfs avatar Aug 08 '22 13:08 rootfs

These are results testing on the varied number of pod from 10 to 100 (as the maximum number of pod per worker node is about 100).

  • general usage for pidstat with 1 second interval of GetPower request
    • 0.04% MEM
    • VSZ 3.8 Gb
    • RSS 3.9 Gb
# Time        UID       PID    %usr %system  %guest   %wait    %CPU   CPU  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
21:41:27        0   2035757    0.00    0.00    0.00    0.00    0.00    22      0.09      0.00 3878724  396764   0.04  python
21:41:28        0   2035757    0.00    0.00    0.00    0.00    0.00    22      0.09      0.00 3878724  396764   0.04  python
21:41:29        0   2035757    0.00    0.00    0.00    0.00    0.00    22      0.09      0.00 3878724  396764   0.04  python
21:41:30        0   2035757    0.00    0.00    0.00    0.00    0.00    22      0.09      0.00 3878724  396764   0.04  python
21:41:31        0   2035757    0.00    0.00    0.00    0.00    0.00    22      0.09      0.00 3878724  396764   0.04  python
  • elapsed time of handle request observed from estimator_client.py (~0.003 s for scikit model, ~0.020 s for ratio)
    • ratio model invokes 10x functions than applying the model (I believe it could be further optimized) Screenshot 2022-08-10 at 12 00 52
  • profiled time (with python cProfile) from estimator_test.py
    • handle request takes more 0.001s for finding specific model name and 0.001s for common tasks.
    • all trained model takes almost the same elapsed time to get power which is 0.001s. Screenshot 2022-08-10 at 12 00 16 Screenshot 2022-08-10 at 12 07 11

summary

  • with 1s interval of request, there is no significant overhead in CPU of estimator. for memory, VSZ=3.8Gb, RSS=3.9Gb
  • common communication latency is about 0.001s-0.002s and corresponding model finding latency is about 0.001s.
  • computation latency is significantly affected by the model function. the general regressor model from scikit learn takes about 0.001s.

@rootfs what do you think?

sunya-ch avatar Aug 10 '22 03:08 sunya-ch

summary

  • with 1s interval of request, there is no significant overhead in CPU of estimator. for memory, VSZ=3.8Gb, RSS=3.9Gb
  • common communication latency is about 0.001s-0.002s and corresponding model finding latency is about 0.001s.
  • computation latency is significantly affected by the model function. the general regressor model from scikit learn takes about 0.001s.

@rootfs what do you think?

@sunya-ch thank you for this comprehensive study! This study result is worth a doc of its own. Please add the result to the PR as well.

looking forward to your full integration

rootfs avatar Aug 10 '22 11:08 rootfs

the work is moved to kepler-estimator, closing

rootfs avatar Aug 31 '22 00:08 rootfs