ml-agents
ml-agents copied to clipboard
Use curriculum and agent group at the same time
Is your feature request related to a problem? Please describe.
I made an asymmetric game that has one ghost and one human, and a ghost tries to catch a human, the human need to run away. Curriculum works well at that time. Now, I change the number of humans to 2, and they need to work together to solve some missions, so I register them as a group. But I find that I cannot set group reward as the measure
of completion_criteria
in curriculum
. And it seems nobody has asked this question before.
Is there any way to use group reward as the measure
now? If not, I have to keep use agent reward as measure
, or control the curriculum by myself.
Describe the solution you'd like
Method 1
Let users choose group_reward as measure
. Since groups is not on the config file so we can only choose behavior as the target of measure. An example as below:
- name: MyFirstLesson
completion_criteria:
measure: group_reward # a new measure
behavior: Human
signal_smoothing: true
min_lesson_length: 100
threshold: 0.2
value: 0.0
But there is a big problem behind this method. Human agents can join different groups at the same time. For example, Human1 and 2 join group A, and Human 3 and 4 join group B, Ghost1 joins group C. In this situation, config above will take all Human1234's mean group reward as measure
rather than only Humans of group A. And still no way to only choose group A as measure target. It is the same way TensorBoard shows group rewards, all Human agents share mean group rewards. So it might be a way to measure group reward.
Describe alternatives you've considered
Method 2
We can improve the function of the curriculum. For example, try to use C# code as a criteria. If we have yaml as:
environment_parameters:
my_envpara:
curriculum:
- name: MyFirstLesson
...
and if users call Academy.Instance.CompletionCriteria("my_envpara")
, then my_envpara
will move to next lesson. This allows users to define curriculum more flexibly. But this also needs the ability to access mean reward of behavior and mean reward of agent group in C# code for users. There might be a lot of work. In that way, users can get the mean reward of group A. If the mean reward is high enough, users call CompletionCriteria
.
Additional context I know that we always want the reward of agents higher. But sometimes we may need rewards to be lower in some situations. For example:
environment_parameters:
my_envpara:
curriculum:
- name: MyLesson1
- name: MyLesson2
- name: MyLesson3
...
In the lesson1 I let Ghost learn to catch Humans, so the mean reward of Ghost will go higher. If the mean reward of Ghost is high enough, there goes to lesson2. In lesson2, I will change some environments to let Human learn to run away. The criteria of lesson2 can be the mean reward of Ghost lower enough, or the mean reward of Human higher enough. But there might be some situations that I cannot use the reward of Human as criteria. What I want to express is, criteria of curriculum need more flexibility. So users can build a more complex curriculum.
A bug report
And there is also a bug in the curriculum. When I set yaml as:
- name: MyFirstLesson
completion_criteria:
measure: reward # a new measure
behavior: Ghost
signal_smoothing: true
min_lesson_length: 100
threshold: -0.75
value: 0.0
and I fix the reward of Ghost as -1, criteria will be satisfied and go to next lesson. But when I set the threshold as -0.74, it will not. For some reason, reward α can satisfy the threshold of 0.75α, and α must be negative. When I fix the mean reward as 0, threshold 0 will not be satisfied and threshold -0.99 will be satisfied. So there is no problem when the mean reward >=0. If the mean reward <0, it can complete the criteria even it is lower than the threshold. Is that a bug or something wrong with my operation?
Here is two real example:
case1:
course_stage:
curriculum:
- value:
sampler_type: constant
sampler_parameters:
seed: 5277
value: 0.0
name: Lesson0 (Ghost learn to catch human, freeze human)
completion_criteria:
behavior: Ghost
measure: reward
min_lesson_length: 1000
signal_smoothing: true
threshold: -0.7499 # <- difference
require_reset: false
Almost all mean reward of Ghost is -1, and the threshold is -0.7499, but it goes to the next lesson at step 20k, which is min_lesson_length.
case2:
curriculum:
- value:
sampler_type: constant
sampler_parameters:
seed: 4928
value: 0.0
name: Lesson0 (Ghost learn to catch human, freeze human)
completion_criteria:
behavior: Ghost
measure: reward
min_lesson_length: 1000
signal_smoothing: true
threshold: -0.74 # <- difference
require_reset: false
All mean reward of Ghost is -1 or even higher, and the threshold is -0.74, but it doesn't go to next lesson until 300k.
This bug makes me cannot build a correct curriculum, which lets Ghost train until the mean reward of Ghost >=0.8 and go to the next lesson.
It is because the code here. The initial value of smoothing is 0 and the initial value of my measure is -1, 0.25*0+0.75*(-1)=>-0.75, the value becomes larger and will go to the next lesson. If the initial value of my measure is 1, 0.25*0+0.75*1=>0.75, the value becomes smaller, so no bug here. That is why only negative values cause this bug. I think there is no need to use smoothing since it is a mean value, so you can just remove smoothing option or renew the way it get its initial value.
Thanks for raising this feature request. This is something on our roadmap since we also received requests for ELO as completion criteria previously. We have been re-thinking the interface design and the goal would be providing a more flexible interface to specify completion criteria with any training stats. I will update when we have the feature ready. Logged internally as MLA-2115.
As of the bug report you mentioned, this is the expected outcome that when you enable smoothing, the lesson update won't happen the exact time when you first reach the threshold value. You can easily turn it off by setting signal_smoothing: false
.
Thanks for raising this feature request. This is something on our roadmap since we also received requests for ELO as completion criteria previously. We have been re-thinking the interface design and the goal would be providing a more flexible interface to specify completion criteria with any training stats. I will update when we have the feature ready. Logged internally as MLA-2115.
:pray: :pray:
As of the bug report you mentioned, this is the expected outcome that when you enable smoothing, the lesson update won't happen the exact time when you first reach the threshold value. You can easily turn it off by setting
signal_smoothing: false
.
I cannot agree with that. In my case, my threshold is -0.7499 and the mean reward is -1, but it still satisfies the criteria. No one will expect this result of the curriculum. If I turn on signal_smoothing
, to prevent this bug, I cannot set the threshold less than -0.74. So I cannot create a curriculum which threshold is -0.8 (if I turn on signal_smoothing
).
The worst thing is, this bug is hard to realize except you do many experiments to find the border (0.75) and see the source code. And nothing about this on the doc. Someone who encounters this may not find the reason. I know I can prevent it easily by turning it off. But people who don't know about this will jump into this bug and cannot find how to solve it. Especially, the example in the document uses signal_smoothing: true
, which raises the probability for users to encounter this bug. Since it is a complex bug hard to describe, he/she is hard to find this issue. So in my opinion, this should be fixed. Here are three solutions:
None initial
Set smoothing
initial as None
. When we want to use smoothing
, if it is None
, let smoothing=measure
first. In this way, the initial value of smoothing
will be the same as measure
. It is quite reasonable since we never see measure
before, so we cannot assume it is 0 before.
Small initial
Set smoothing
initial as something like -1e7, In this way, the bug will not happen. It may take some epoch to let smoothing
close to measure
, but it is ignorable.
Update every epoch
As the source code, smoothing will be updated only if min_lesson_length
is satisfied. You can update smoothing every time you enter the need_increment
function. This will let the value be more smoothing since it takes the epoch before min_lesson_length
into consideration. But it cannot solve the bug, so it has to work together with Small initial
or None initial
. And this method can decrease the negative effect of the other two methods.
One more advice about curriculum
completion_criteria
will sum min_lesson_length
epoches, so training summary may greater than threshold while completion_criteria
is not satisfied. It may be confusing and unclear for users to wait for lesson increases. So I recommend that the summary show the mean reward of min_lesson_length
epoches of measure
. Users can clearly know what is the value used for completion_criteria
. It can be one of the configs and the default is turn on.
Maybe a bug
I find that if you train more than min_lesson_length
epoches first, then stop training, and then use --resume
option to start training again. Variable reward_buffer
will be reset to an empty list, so lesson_length
will be 0. That means so you have to train at lease more min_lesson_length
epoches to fit the requirements of min_lesson_length
. I wonder if that is normal or not.