ml-agents icon indicating copy to clipboard operation
ml-agents copied to clipboard

Use curriculum and agent group at the same time

Open nathan60107 opened this issue 3 years ago • 2 comments

Is your feature request related to a problem? Please describe. I made an asymmetric game that has one ghost and one human, and a ghost tries to catch a human, the human need to run away. Curriculum works well at that time. Now, I change the number of humans to 2, and they need to work together to solve some missions, so I register them as a group. But I find that I cannot set group reward as the measure of completion_criteria in curriculum. And it seems nobody has asked this question before. Is there any way to use group reward as the measure now? If not, I have to keep use agent reward as measure, or control the curriculum by myself.

Describe the solution you'd like

Method 1

Let users choose group_reward as measure. Since groups is not on the config file so we can only choose behavior as the target of measure. An example as below:

- name: MyFirstLesson
    completion_criteria:
      measure: group_reward # a new measure
      behavior: Human
      signal_smoothing: true
      min_lesson_length: 100
      threshold: 0.2
    value: 0.0

But there is a big problem behind this method. Human agents can join different groups at the same time. For example, Human1 and 2 join group A, and Human 3 and 4 join group B, Ghost1 joins group C. In this situation, config above will take all Human1234's mean group reward as measure rather than only Humans of group A. And still no way to only choose group A as measure target. It is the same way TensorBoard shows group rewards, all Human agents share mean group rewards. So it might be a way to measure group reward.

Describe alternatives you've considered

Method 2

We can improve the function of the curriculum. For example, try to use C# code as a criteria. If we have yaml as:

environment_parameters:
  my_envpara:
    curriculum:
      - name: MyFirstLesson
...

and if users call Academy.Instance.CompletionCriteria("my_envpara"), then my_envpara will move to next lesson. This allows users to define curriculum more flexibly. But this also needs the ability to access mean reward of behavior and mean reward of agent group in C# code for users. There might be a lot of work. In that way, users can get the mean reward of group A. If the mean reward is high enough, users call CompletionCriteria.

Additional context I know that we always want the reward of agents higher. But sometimes we may need rewards to be lower in some situations. For example:

environment_parameters:
  my_envpara:
    curriculum:
      - name: MyLesson1
      - name: MyLesson2
      - name: MyLesson3
...

In the lesson1 I let Ghost learn to catch Humans, so the mean reward of Ghost will go higher. If the mean reward of Ghost is high enough, there goes to lesson2. In lesson2, I will change some environments to let Human learn to run away. The criteria of lesson2 can be the mean reward of Ghost lower enough, or the mean reward of Human higher enough. But there might be some situations that I cannot use the reward of Human as criteria. What I want to express is, criteria of curriculum need more flexibility. So users can build a more complex curriculum.

A bug report

And there is also a bug in the curriculum. When I set yaml as:

- name: MyFirstLesson
    completion_criteria:
      measure: reward # a new measure
      behavior: Ghost
      signal_smoothing: true
      min_lesson_length: 100
      threshold: -0.75
    value: 0.0

and I fix the reward of Ghost as -1, criteria will be satisfied and go to next lesson. But when I set the threshold as -0.74, it will not. For some reason, reward α can satisfy the threshold of 0.75α, and α must be negative. When I fix the mean reward as 0, threshold 0 will not be satisfied and threshold -0.99 will be satisfied. So there is no problem when the mean reward >=0. If the mean reward <0, it can complete the criteria even it is lower than the threshold. Is that a bug or something wrong with my operation?

Here is two real example:

case1:

course_stage:
    curriculum:
    - value:
        sampler_type: constant
        sampler_parameters:
          seed: 5277
          value: 0.0
      name: Lesson0 (Ghost learn to catch human, freeze human)
      completion_criteria:
        behavior: Ghost
        measure: reward
        min_lesson_length: 1000
        signal_smoothing: true
        threshold: -0.7499 # <- difference
        require_reset: false

圖片 Almost all mean reward of Ghost is -1, and the threshold is -0.7499, but it goes to the next lesson at step 20k, which is min_lesson_length.

case2:

curriculum:
    - value:
        sampler_type: constant
        sampler_parameters:
          seed: 4928
          value: 0.0
      name: Lesson0 (Ghost learn to catch human, freeze human)
      completion_criteria:
        behavior: Ghost
        measure: reward
        min_lesson_length: 1000
        signal_smoothing: true
        threshold: -0.74 # <- difference
        require_reset: false

圖片 All mean reward of Ghost is -1 or even higher, and the threshold is -0.74, but it doesn't go to next lesson until 300k. This bug makes me cannot build a correct curriculum, which lets Ghost train until the mean reward of Ghost >=0.8 and go to the next lesson.

It is because the code here. The initial value of smoothing is 0 and the initial value of my measure is -1, 0.25*0+0.75*(-1)=>-0.75, the value becomes larger and will go to the next lesson. If the initial value of my measure is 1, 0.25*0+0.75*1=>0.75, the value becomes smaller, so no bug here. That is why only negative values cause this bug. I think there is no need to use smoothing since it is a mean value, so you can just remove smoothing option or renew the way it get its initial value.

nathan60107 avatar Jul 30 '21 20:07 nathan60107

Thanks for raising this feature request. This is something on our roadmap since we also received requests for ELO as completion criteria previously. We have been re-thinking the interface design and the goal would be providing a more flexible interface to specify completion criteria with any training stats. I will update when we have the feature ready. Logged internally as MLA-2115.

As of the bug report you mentioned, this is the expected outcome that when you enable smoothing, the lesson update won't happen the exact time when you first reach the threshold value. You can easily turn it off by setting signal_smoothing: false.

dongruoping avatar Aug 03 '21 01:08 dongruoping

Thanks for raising this feature request. This is something on our roadmap since we also received requests for ELO as completion criteria previously. We have been re-thinking the interface design and the goal would be providing a more flexible interface to specify completion criteria with any training stats. I will update when we have the feature ready. Logged internally as MLA-2115.

:pray: :pray:

As of the bug report you mentioned, this is the expected outcome that when you enable smoothing, the lesson update won't happen the exact time when you first reach the threshold value. You can easily turn it off by setting signal_smoothing: false.

I cannot agree with that. In my case, my threshold is -0.7499 and the mean reward is -1, but it still satisfies the criteria. No one will expect this result of the curriculum. If I turn on signal_smoothing, to prevent this bug, I cannot set the threshold less than -0.74. So I cannot create a curriculum which threshold is -0.8 (if I turn on signal_smoothing).

The worst thing is, this bug is hard to realize except you do many experiments to find the border (0.75) and see the source code. And nothing about this on the doc. Someone who encounters this may not find the reason. I know I can prevent it easily by turning it off. But people who don't know about this will jump into this bug and cannot find how to solve it. Especially, the example in the document uses signal_smoothing: true, which raises the probability for users to encounter this bug. Since it is a complex bug hard to describe, he/she is hard to find this issue. So in my opinion, this should be fixed. Here are three solutions:

None initial

Set smoothing initial as None. When we want to use smoothing, if it is None, let smoothing=measure first. In this way, the initial value of smoothing will be the same as measure. It is quite reasonable since we never see measure before, so we cannot assume it is 0 before.

Small initial

Set smoothing initial as something like -1e7, In this way, the bug will not happen. It may take some epoch to let smoothing close to measure, but it is ignorable.

Update every epoch

As the source code, smoothing will be updated only if min_lesson_length is satisfied. You can update smoothing every time you enter the need_increment function. This will let the value be more smoothing since it takes the epoch before min_lesson_length into consideration. But it cannot solve the bug, so it has to work together with Small initial or None initial. And this method can decrease the negative effect of the other two methods.


One more advice about curriculum

completion_criteria will sum min_lesson_length epoches, so training summary may greater than threshold while completion_criteria is not satisfied. It may be confusing and unclear for users to wait for lesson increases. So I recommend that the summary show the mean reward of min_lesson_length epoches of measure. Users can clearly know what is the value used for completion_criteria. It can be one of the configs and the default is turn on.


Maybe a bug

I find that if you train more than min_lesson_length epoches first, then stop training, and then use --resume option to start training again. Variable reward_buffer will be reset to an empty list, so lesson_length will be 0. That means so you have to train at lease more min_lesson_length epoches to fit the requirements of min_lesson_length. I wonder if that is normal or not.

nathan60107 avatar Aug 03 '21 12:08 nathan60107