active_osl how can the lstm learn from request label?

1.how can the lstm learn from the request label given next timestep?

I found something that could be wrong

the reward code: a_t = tf.cast(a_t, tf.float32) label_t = tf.cast(label_t, tf.float32) rewards_t = label_t*params.reward_correct + (1-label_t)*params.reward_incorrect # (batch_size, num_labels), tf.float32 rewards_t = tf.pad(rewards_t, [[0,0],[0,1]], constant_values=params.reward_request) # (batch_size, num_labels+1), tf.float32

r_t = tf.reduce_sum(rewards_t*a_t, axis=1) # (batch_size), tf.float32

is the code means the agent will get a reward no matter whether it requests the label or not,and the reward would guide agent to choose the right choice?

Jul 05 '18 07:07 flazerain

We have not done experiments to see how the lstm is learning. The assumption is that it has learned to store a representation of the example in the hidden state, and on the subsequent step store the label for that example. Then, in future steps, compare the new example with the stored representations for previous examples.
rewards_t stores rewards for all possible actions (i.e. n_labels-1 reward_incorrect, 1 reward_correct, 1 reward_request. a_t is a one-hot vector for the chosen action. The multiplication selects out the reward for the chosen action.

Jul 06 '18 17:07 markpwoodward

but in the view of my comprehension, it seems have some problem if every action has a direct reward. For a random policy at the beginning,the agent can learn to make decision correctly only from incorrect or correct reward,without the request. or I have some misunderstand?

Jul 06 '18 17:07 flazerain

Let me try to give an example of possible learning path.

The agent randomly (e-greedy) chooses the "request label" action on the first time step of an episode, and receives -0.05 reward
The label is supplied on the next step
later in the episode an example of the same class is provided as input
The agent randomly (e-greedy) predicts the correct label, and receives +1

back-propagation, in reducing sum of the Q-function errors over the entire episode, makes the connection that requesting the label, incurring a small penalty, storing the label, will then allow it to predict correctly later and receive a larger reward.

At a high level, understanding the supervised situation first, might help in understanding my paper: https://arxiv.org/abs/1605.06065

On Fri, Jul 6, 2018 at 10:34 AM, flazerain [email protected] wrote:

but in the view of my comprehension, it seems have some problem if every action have a direct reward. For a random policy at the beginning,the agent can learn to make decision correctly only from incorrect or correct reward,without the request. or I have some misunderstand?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/markpwoodward/active_osl/issues/1#issuecomment-403100020, or mute the thread https://github.com/notifications/unsubscribe-auth/AGgTpdgjH4O8UR0vlayiz6hJVZBPApxxks5uD5-wgaJpZM4VDafr .

Jul 06 '18 17:07 markpwoodward

but if it randomly chooses a correct action for classification at the first time,it will get 1 reward,because every action have a direct reward.it seems like a supervised signal for the unlabel data the label isn't requested at this time.

Jul 06 '18 18:07 flazerain

The lstm never sees the rewards. The rewards are used in the loss after all actions have been chosen for that episode.

On Fri, Jul 6, 2018 at 11:02 AM, flazerain [email protected] wrote:

but if it randomly chooses a correct action for classification,it will get 1 reward. it seems like a supervise signal for the unlabel data.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/markpwoodward/active_osl/issues/1#issuecomment-403106767, or mute the thread https://github.com/notifications/unsubscribe-auth/AGgTpTuI4M_Cl3norK9n-9QMJFv2jS5nks5uD6YngaJpZM4VDafr .

Jul 06 '18 20:07 markpwoodward

active_osl active_osl copied to clipboard

how can the lstm learn from request label?

active_osl
active_osl copied to clipboard