PaddleCloud
PaddleCloud copied to clipboard
Use AI to replace rule based autoscaling algorithm
@helinwang and I were talking with PR folk in Baidu regarding our autoscaling feature, he brought up a good point that can we use AI to replace our current rule-based autoscaling algorithm? first of all, this feature is definitely not for this release and will need a lot of discussions. based on this thought, I'm thinking the following:
- is this possible or doable? can we gather enough data for this model? how are we going to measure if the decision made by this model is a "good" decision?
- do we need to extend our training-job yaml def, so that it can expose more info we can pick up as feature to feed the model?
- when refactoring is finished, can we pick up the protobuf of computation graph as input to the model?
- can we estimate the training time needed for a particular training job?
- can we estimate and measure the work(computation effort) needed for a particular training job?
I don't have answers to above questions yet, just some immature initial thoughts. Note them down here for further references and "抛砖引玉" 😉
I think reinforcement learning is perfect for this kind of tasks (planning).
Yes, it's possible, but requires a lot of online cluster data.