Bahador Bakhshi
Bahador Bakhshi
Different approaches can be used for decaying the alpha and epsilon, for example alpha = alpha0 / (1 + iteration * decay)
If the consumer domain can select the overcharging, maybe in sometimes, it prefer to overcharge to keep resources for other demands
Q(s, a) ← (1- α) Q(s, a) + α [r + γ ⋅ max a' f(Q(s', a'),N(s', a'))] In this equation: - N(s′, a′) counts the number of times the...
Instead of MDP and Q/R Learning, Contextual bandit may be also applicable
Dyna needs more less episodes to converge It seems that, in large problems, it is really beneficial to use it instead of direct Q-Learning
Double learning can be applied for QL, SARSA and Expected SARSA
Does the "Expected SARSA" do better than QL?
Does SARSA do better than QL?