kepler
kepler copied to clipboard
implement model-based power estimator
This PR introduces a dynamic way to estimate the power by Estimator class (pkg/model/estimator.go).
- the model is supposed to be dynamically downloaded to the folder
data/model
- python program running as a child process to apply the trained model to the read value via unix domain socket
- model class is implemented in python now supporting
.h5
of keras model,.sav
of scikit-learn model, and simple ratio model computed metric importance by correlation to power
There are additional three dependent points to integrate this class to the Kepler
- initialize in
exporter.go
errCh := make(chan error)
estimator := &model.Estimator{
Err: errCh,
}
// start python program (pkg/model/py/estimator.py)
// it will listen for PowerRequest by the unix domain socket "/tmp/estimator.sock"
go estimator.StartPyEstimator()
defer estimator.Destroy()
- call
GetPower
function inreader.go
// it will create PowerRequest and send to estimator.py via the unix domain socket
(e *Estimator) GetPower(modelName string, xCols []string, xValues [][]float32, corePower, dramPower, gpuPower, otherPower []float32) []float32 {}
- modelName refers to the model folder in
/data/model
which containsmetadata.json
giving the rest details of model such as model file, feature engineering pkl files, features, error, so on. (auto-select the minimum error model if it is empty, "") - xCols refers to features
- xValues refers to values of each features for each pods [no. pods x no. features]
- corePower refers to core power for each package (leave it empty if not available)
- dramPower, gpuPower, otherPower same to corePower
- put initial models to
data/model
of container folder (can be done by statically add in the docker image or deployment manifest volumes)
check example use in pkg/model/estimator_test.go
If you are agree with this direction, we can modify estimator.py to
- support other modeling classes
- select the applicable features from available features
- connect to kepler-model-server to update the model
Signed-off-by: Sunyanan Choochotkaew [email protected]
thank you @sunya-ch for this impressive work!
wonder how much cpu and memory the estimator will consume, do you have any data?
That is great, I really want to have different Power Models, specially bring back the Power Model based on Ratio
That is great, I really want to have different Power Models, specially bring back the Power Model based on Ratio
In the current implementation, I treat the ratio approach same to the trained approach considered it as a model. With this way, you can dynamically update the importance of the ratio metric for example when you find higher correlated metric. What do you think?
I'm wondering if instead of calling the python code, we could have a micro-service running as Power Model Server (which will be running the python code) and we access it through an API (http or grpc)... That way, we can enforce a good design pattern using APIs to communicate between different programming languages.
@rootfs was creating an external server to do something like this right? A server that receives some data, does some calculations, and responds to some information. This server can be a container that expose an API to receive the data and return the energy consumption.
thank you @sunya-ch for this impressive work!
wonder how much cpu and memory the estimator will consume, do you have any data?
I have no experimental data yet. It passes the feature values through the unix domain socket and just apply the mathematical model to it for estimation. The training process is not included here.
We will also need some documentation, describing how to configure the models and with details about the supported models
I'm wondering if instead of calling the python code, we could have a micro-service running as Power Model Server (which will be running the python code) and we access it through an API (http or grpc)... That way, we can enforce a good design pattern using APIs to communicate between different programming languages.
@rootfs was creating an external server to do something like this right? A server that receives some data, does some calculations, and responds to some information. This server can be a container that expose an API to receive the data and return the energy consumption.
The external server is for training the model which not all data should be sent. This module is called for prediction for every read data. I think it might be better to use local socket instead of going through the networks for microservice.
If we need to use python, I will strongly argue to have it running as a different service
Otherwise, we could consider using a machine learning library in golang https://upstack.co/knowledge/golang-machine-learning
If we need to use python, I will strongly argue to have it running as a different service
Otherwise, we could consider using a machine learning library in golang https://upstack.co/knowledge/golang-machine-learning
I think at this step, unix domain socket should be fine because it is just to apply the trained method (it is not going to do training or fancy thing rather than read the trained weight and do multiplications to the read data). Migrating to golang is a big task and I don't think it will have that much impact. We can put it in the future work if all sets.
I will evaluate the end-to-end power estimation time per one tick.
If we need to use python, I will strongly argue to have it running as a different service
Otherwise, we could consider using a machine learning library in golang https://upstack.co/knowledge/golang-machine-learning
+1 that
The estimator python may have its own repo and run as a sidecar, so we don't have to upgrade the kepler container image if the estimator changes.
I amended the commit by
- separating estimator for building another Docker image to run as a sidecar container.
To build the image, run
make build-estimator
The new image will bequay.io/sustainable_computing_io/kepler-estimator:latest
- changing variable name of xCols and xValues.
TO-DO:
- [ ] measure overhead
- [ ] update reader.go to use GetPower
- [ ] add sidecar container to kepler deployment
sounds good, I just created a repo there for your next push quay.io/sustainable_computing_io/kepler-estimator
These are results testing on the varied number of pod from 10 to 100 (as the maximum number of pod per worker node is about 100).
- general usage for pidstat with 1 second interval of GetPower request
- 0.04% MEM
- VSZ 3.8 Gb
- RSS 3.9 Gb
# Time UID PID %usr %system %guest %wait %CPU CPU minflt/s majflt/s VSZ RSS %MEM Command
21:41:27 0 2035757 0.00 0.00 0.00 0.00 0.00 22 0.09 0.00 3878724 396764 0.04 python
21:41:28 0 2035757 0.00 0.00 0.00 0.00 0.00 22 0.09 0.00 3878724 396764 0.04 python
21:41:29 0 2035757 0.00 0.00 0.00 0.00 0.00 22 0.09 0.00 3878724 396764 0.04 python
21:41:30 0 2035757 0.00 0.00 0.00 0.00 0.00 22 0.09 0.00 3878724 396764 0.04 python
21:41:31 0 2035757 0.00 0.00 0.00 0.00 0.00 22 0.09 0.00 3878724 396764 0.04 python
- elapsed time of handle request observed from estimator_client.py (~0.003 s for scikit model, ~0.020 s for ratio)
- ratio model invokes 10x functions than applying the model (I believe it could be further optimized)
- ratio model invokes 10x functions than applying the model (I believe it could be further optimized)
- profiled time (with python cProfile) from estimator_test.py
- handle request takes more 0.001s for finding specific model name and 0.001s for common tasks.
- all trained model takes almost the same elapsed time to get power which is 0.001s.
summary
- with 1s interval of request, there is no significant overhead in CPU of estimator. for memory, VSZ=3.8Gb, RSS=3.9Gb
- common communication latency is about 0.001s-0.002s and corresponding model finding latency is about 0.001s.
- computation latency is significantly affected by the model function. the general regressor model from scikit learn takes about 0.001s.
@rootfs what do you think?
summary
- with 1s interval of request, there is no significant overhead in CPU of estimator. for memory, VSZ=3.8Gb, RSS=3.9Gb
- common communication latency is about 0.001s-0.002s and corresponding model finding latency is about 0.001s.
- computation latency is significantly affected by the model function. the general regressor model from scikit learn takes about 0.001s.
@rootfs what do you think?
@sunya-ch thank you for this comprehensive study! This study result is worth a doc of its own. Please add the result to the PR as well.
looking forward to your full integration
the work is moved to kepler-estimator, closing