active_osl
active_osl copied to clipboard
how can the lstm learn from request label?
1.how can the lstm learn from the request label given next timestep?
- I found something that could be wrong
the reward code:
a_t = tf.cast(a_t, tf.float32)
label_t = tf.cast(label_t, tf.float32)
rewards_t = label_t*params.reward_correct + (1-label_t)*params.reward_incorrect # (batch_size, num_labels), tf.float32
rewards_t = tf.pad(rewards_t, [[0,0],[0,1]], constant_values=params.reward_request) # (batch_size, num_labels+1), tf.float32
r_t = tf.reduce_sum(rewards_t*a_t, axis=1) # (batch_size), tf.float32
is the code means the agent will get a reward no matter whether it requests the label or not,and the reward would guide agent to choose the right choice?
-
We have not done experiments to see how the lstm is learning. The assumption is that it has learned to store a representation of the example in the hidden state, and on the subsequent step store the label for that example. Then, in future steps, compare the new example with the stored representations for previous examples.
-
rewards_t stores rewards for all possible actions (i.e. n_labels-1 reward_incorrect, 1 reward_correct, 1 reward_request. a_t is a one-hot vector for the chosen action. The multiplication selects out the reward for the chosen action.
but in the view of my comprehension, it seems have some problem if every action has a direct reward. For a random policy at the beginning,the agent can learn to make decision correctly only from incorrect or correct reward,without the request. or I have some misunderstand?
Let me try to give an example of possible learning path.
- The agent randomly (e-greedy) chooses the "request label" action on the first time step of an episode, and receives -0.05 reward
- The label is supplied on the next step
- later in the episode an example of the same class is provided as input
- The agent randomly (e-greedy) predicts the correct label, and receives +1
back-propagation, in reducing sum of the Q-function errors over the entire episode, makes the connection that requesting the label, incurring a small penalty, storing the label, will then allow it to predict correctly later and receive a larger reward.
At a high level, understanding the supervised situation first, might help in understanding my paper: https://arxiv.org/abs/1605.06065
On Fri, Jul 6, 2018 at 10:34 AM, flazerain [email protected] wrote:
but in the view of my comprehension, it seems have some problem if every action have a direct reward. For a random policy at the beginning,the agent can learn to make decision correctly only from incorrect or correct reward,without the request. or I have some misunderstand?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/markpwoodward/active_osl/issues/1#issuecomment-403100020, or mute the thread https://github.com/notifications/unsubscribe-auth/AGgTpdgjH4O8UR0vlayiz6hJVZBPApxxks5uD5-wgaJpZM4VDafr .
but if it randomly chooses a correct action for classification at the first time,it will get 1 reward,because every action have a direct reward.it seems like a supervised signal for the unlabel data the label isn't requested at this time.
The lstm never sees the rewards. The rewards are used in the loss after all actions have been chosen for that episode.
On Fri, Jul 6, 2018 at 11:02 AM, flazerain [email protected] wrote:
but if it randomly chooses a correct action for classification,it will get 1 reward. it seems like a supervise signal for the unlabel data.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/markpwoodward/active_osl/issues/1#issuecomment-403106767, or mute the thread https://github.com/notifications/unsubscribe-auth/AGgTpTuI4M_Cl3norK9n-9QMJFv2jS5nks5uD6YngaJpZM4VDafr .