active_osl icon indicating copy to clipboard operation
active_osl copied to clipboard

how can the lstm learn from request label?

Open flazerain opened this issue 6 years ago • 5 comments

1.how can the lstm learn from the request label given next timestep?

  1. I found something that could be wrong

the reward code: a_t = tf.cast(a_t, tf.float32) label_t = tf.cast(label_t, tf.float32) rewards_t = label_t*params.reward_correct + (1-label_t)*params.reward_incorrect # (batch_size, num_labels), tf.float32 rewards_t = tf.pad(rewards_t, [[0,0],[0,1]], constant_values=params.reward_request) # (batch_size, num_labels+1), tf.float32

r_t = tf.reduce_sum(rewards_t*a_t, axis=1) # (batch_size), tf.float32

is the code means the agent will get a reward no matter whether it requests the label or not,and the reward would guide agent to choose the right choice?

flazerain avatar Jul 05 '18 07:07 flazerain

  1. We have not done experiments to see how the lstm is learning. The assumption is that it has learned to store a representation of the example in the hidden state, and on the subsequent step store the label for that example. Then, in future steps, compare the new example with the stored representations for previous examples.

  2. rewards_t stores rewards for all possible actions (i.e. n_labels-1 reward_incorrect, 1 reward_correct, 1 reward_request. a_t is a one-hot vector for the chosen action. The multiplication selects out the reward for the chosen action.

markpwoodward avatar Jul 06 '18 17:07 markpwoodward

but in the view of my comprehension, it seems have some problem if every action has a direct reward. For a random policy at the beginning,the agent can learn to make decision correctly only from incorrect or correct reward,without the request. or I have some misunderstand?

flazerain avatar Jul 06 '18 17:07 flazerain

Let me try to give an example of possible learning path.

  1. The agent randomly (e-greedy) chooses the "request label" action on the first time step of an episode, and receives -0.05 reward
  2. The label is supplied on the next step
  3. later in the episode an example of the same class is provided as input
  4. The agent randomly (e-greedy) predicts the correct label, and receives +1

back-propagation, in reducing sum of the Q-function errors over the entire episode, makes the connection that requesting the label, incurring a small penalty, storing the label, will then allow it to predict correctly later and receive a larger reward.

At a high level, understanding the supervised situation first, might help in understanding my paper: https://arxiv.org/abs/1605.06065

On Fri, Jul 6, 2018 at 10:34 AM, flazerain [email protected] wrote:

but in the view of my comprehension, it seems have some problem if every action have a direct reward. For a random policy at the beginning,the agent can learn to make decision correctly only from incorrect or correct reward,without the request. or I have some misunderstand?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/markpwoodward/active_osl/issues/1#issuecomment-403100020, or mute the thread https://github.com/notifications/unsubscribe-auth/AGgTpdgjH4O8UR0vlayiz6hJVZBPApxxks5uD5-wgaJpZM4VDafr .

markpwoodward avatar Jul 06 '18 17:07 markpwoodward

but if it randomly chooses a correct action for classification at the first time,it will get 1 reward,because every action have a direct reward.it seems like a supervised signal for the unlabel data the label isn't requested at this time.

flazerain avatar Jul 06 '18 18:07 flazerain

The lstm never sees the rewards. The rewards are used in the loss after all actions have been chosen for that episode.

On Fri, Jul 6, 2018 at 11:02 AM, flazerain [email protected] wrote:

but if it randomly chooses a correct action for classification,it will get 1 reward. it seems like a supervise signal for the unlabel data.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/markpwoodward/active_osl/issues/1#issuecomment-403106767, or mute the thread https://github.com/notifications/unsubscribe-auth/AGgTpTuI4M_Cl3norK9n-9QMJFv2jS5nks5uD6YngaJpZM4VDafr .

markpwoodward avatar Jul 06 '18 20:07 markpwoodward