kepler implement model-based power estimator

This PR introduces a dynamic way to estimate the power by Estimator class (pkg/model/estimator.go).

the model is supposed to be dynamically downloaded to the folder data/model
python program running as a child process to apply the trained model to the read value via unix domain socket
model class is implemented in python now supporting .h5 of keras model, .sav of scikit-learn model, and simple ratio model computed metric importance by correlation to power

There are additional three dependent points to integrate this class to the Kepler

initialize in exporter.go

errCh := make(chan error)
estimator := &model.Estimator{
   Err: errCh,
}
// start python program (pkg/model/py/estimator.py) 
// it will listen for PowerRequest by the unix domain socket "/tmp/estimator.sock"
go estimator.StartPyEstimator()
defer estimator.Destroy()

call GetPower function in reader.go

// it will create PowerRequest and send to estimator.py via the unix domain socket
(e *Estimator) GetPower(modelName string, xCols []string, xValues [][]float32, corePower, dramPower, gpuPower, otherPower []float32) []float32 {}

modelName refers to the model folder in /data/model which contains metadata.json giving the rest details of model such as model file, feature engineering pkl files, features, error, so on. (auto-select the minimum error model if it is empty, "")
xCols refers to features
xValues refers to values of each features for each pods [no. pods x no. features]
corePower refers to core power for each package (leave it empty if not available)
dramPower, gpuPower, otherPower same to corePower

put initial models to data/model of container folder (can be done by statically add in the docker image or deployment manifest volumes)

check example use in pkg/model/estimator_test.go

If you are agree with this direction, we can modify estimator.py to

support other modeling classes
select the applicable features from available features
connect to kepler-model-server to update the model

Signed-off-by: Sunyanan Choochotkaew [email protected]

Aug 05 '22 13:08 sunya-ch

thank you @sunya-ch for this impressive work!

wonder how much cpu and memory the estimator will consume, do you have any data?

Aug 05 '22 13:08 rootfs

That is great, I really want to have different Power Models, specially bring back the Power Model based on Ratio

Aug 05 '22 13:08 marceloamaral

That is great, I really want to have different Power Models, specially bring back the Power Model based on Ratio

In the current implementation, I treat the ratio approach same to the trained approach considered it as a model. With this way, you can dynamically update the importance of the ratio metric for example when you find higher correlated metric. What do you think?

Aug 05 '22 14:08 sunya-ch

I'm wondering if instead of calling the python code, we could have a micro-service running as Power Model Server (which will be running the python code) and we access it through an API (http or grpc)... That way, we can enforce a good design pattern using APIs to communicate between different programming languages.

@rootfs was creating an external server to do something like this right? A server that receives some data, does some calculations, and responds to some information. This server can be a container that expose an API to receive the data and return the energy consumption.

Aug 05 '22 14:08 marceloamaral

thank you @sunya-ch for this impressive work!

wonder how much cpu and memory the estimator will consume, do you have any data?

I have no experimental data yet. It passes the feature values through the unix domain socket and just apply the mathematical model to it for estimation. The training process is not included here.

Aug 05 '22 14:08 sunya-ch

We will also need some documentation, describing how to configure the models and with details about the supported models

Aug 05 '22 14:08 marceloamaral

I'm wondering if instead of calling the python code, we could have a micro-service running as Power Model Server (which will be running the python code) and we access it through an API (http or grpc)... That way, we can enforce a good design pattern using APIs to communicate between different programming languages.

@rootfs was creating an external server to do something like this right? A server that receives some data, does some calculations, and responds to some information. This server can be a container that expose an API to receive the data and return the energy consumption.

The external server is for training the model which not all data should be sent. This module is called for prediction for every read data. I think it might be better to use local socket instead of going through the networks for microservice.

Aug 05 '22 14:08 sunya-ch

If we need to use python, I will strongly argue to have it running as a different service

Otherwise, we could consider using a machine learning library in golang https://upstack.co/knowledge/golang-machine-learning

Aug 05 '22 14:08 marceloamaral

If we need to use python, I will strongly argue to have it running as a different service

Otherwise, we could consider using a machine learning library in golang https://upstack.co/knowledge/golang-machine-learning

I think at this step, unix domain socket should be fine because it is just to apply the trained method (it is not going to do training or fancy thing rather than read the trained weight and do multiplications to the read data). Migrating to golang is a big task and I don't think it will have that much impact. We can put it in the future work if all sets.

I will evaluate the end-to-end power estimation time per one tick.

Aug 05 '22 14:08 sunya-ch

If we need to use python, I will strongly argue to have it running as a different service

Otherwise, we could consider using a machine learning library in golang https://upstack.co/knowledge/golang-machine-learning

+1 that

The estimator python may have its own repo and run as a sidecar, so we don't have to upgrade the kepler container image if the estimator changes.

Aug 05 '22 15:08 rootfs

I amended the commit by

separating estimator for building another Docker image to run as a sidecar container. To build the image, run make build-estimator The new image will be quay.io/sustainable_computing_io/kepler-estimator:latest
changing variable name of xCols and xValues.

TO-DO:

[ ] measure overhead
[ ] update reader.go to use GetPower
[ ] add sidecar container to kepler deployment

Aug 08 '22 11:08 sunya-ch

sounds good, I just created a repo there for your next push quay.io/sustainable_computing_io/kepler-estimator

Aug 08 '22 13:08 rootfs

These are results testing on the varied number of pod from 10 to 100 (as the maximum number of pod per worker node is about 100).

general usage for pidstat with 1 second interval of GetPower request
- 0.04% MEM
- VSZ 3.8 Gb
- RSS 3.9 Gb

# Time        UID       PID    %usr %system  %guest   %wait    %CPU   CPU  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
21:41:27        0   2035757    0.00    0.00    0.00    0.00    0.00    22      0.09      0.00 3878724  396764   0.04  python
21:41:28        0   2035757    0.00    0.00    0.00    0.00    0.00    22      0.09      0.00 3878724  396764   0.04  python
21:41:29        0   2035757    0.00    0.00    0.00    0.00    0.00    22      0.09      0.00 3878724  396764   0.04  python
21:41:30        0   2035757    0.00    0.00    0.00    0.00    0.00    22      0.09      0.00 3878724  396764   0.04  python
21:41:31        0   2035757    0.00    0.00    0.00    0.00    0.00    22      0.09      0.00 3878724  396764   0.04  python

elapsed time of handle request observed from estimator_client.py (~0.003 s for scikit model, ~0.020 s for ratio)
- ratio model invokes 10x functions than applying the model (I believe it could be further optimized)
profiled time (with python cProfile) from estimator_test.py
- handle request takes more 0.001s for finding specific model name and 0.001s for common tasks.
- all trained model takes almost the same elapsed time to get power which is 0.001s.

summary

with 1s interval of request, there is no significant overhead in CPU of estimator. for memory, VSZ=3.8Gb, RSS=3.9Gb
common communication latency is about 0.001s-0.002s and corresponding model finding latency is about 0.001s.
computation latency is significantly affected by the model function. the general regressor model from scikit learn takes about 0.001s.

@rootfs what do you think?

Aug 10 '22 03:08 sunya-ch

summary

with 1s interval of request, there is no significant overhead in CPU of estimator. for memory, VSZ=3.8Gb, RSS=3.9Gb

common communication latency is about 0.001s-0.002s and corresponding model finding latency is about 0.001s.

computation latency is significantly affected by the model function. the general regressor model from scikit learn takes about 0.001s.

@rootfs what do you think?

@sunya-ch thank you for this comprehensive study! This study result is worth a doc of its own. Please add the result to the PR as well.

looking forward to your full integration

Aug 10 '22 11:08 rootfs

the work is moved to kepler-estimator, closing

Aug 31 '22 00:08 rootfs

kepler kepler copied to clipboard

implement model-based power estimator

summary

summary

kepler
kepler copied to clipboard