social-attention icon indicating copy to clipboard operation
social-attention copied to clipboard

Ego-Attention Paper

Open skynox03 opened this issue 3 years ago • 2 comments

Hi eleurent,

I tried using ego attention as you suggested but had some doubts from the ego-attention paper and implementation:

  1. So the output from encoder is fed into the Lk, Lv, Lq as input and then the output are some values q, k, v. In the paper it says that q is query, k_i are descriptive features and v_i are also features computed from linear projection of Lk, Lv, Lq. Can you maybe please explain a little more elaborately that what are these input and output values for Lk, Lv, Lq layers and what is the exact meaning of these q, k and v? I tried to find them in the code implementation but could not understand. If they are based on features, then how is it decided which belongs to, for example k or v? Also, how many Lk, Lv, Lq blocks are there in the architecture? As it looks in the figure 3, three blocks (Lk, Lv, Lq) are for embedding layer output and two blocks (Lk, Lv) are for other embedding layer output. Is it right?

  2. The paper says that the green head only watches the vehicle coming from left and blue head watches front and right. Is it still applicable in highway_env? What if I set see behind as true? Then which one would be responsible? As in how can I decided which head watches which direction? And also how are these colors decided?

skynox03 avatar May 04 '21 08:05 skynox03

Hi @skynox03,

Question 1.

The dot-product self-attention mechanism allows to reproduce the behavior of a key-value store.

For each vehicle, we start with an embedding h_i, and produce:

  • a key k_i, that represent some features of the vehicle that we want to match
  • a value v_i, which represent some other features of the vehicle that we want to propagate deeper, but only for the matched vehicles

And then, the ego-vehicle to emits a query q, and we retrieve the vehicles whose key k_i matches best this query, by means of a dot product. This provides us with weights to sum the vehicles' values and obtain the final output.

Of course, since all these embeddings are learnt, we cannot control for sure what these features means, whether they should belong to k or v, etc. The training is simply tasked to make these converge to something useful for minimizing the loss.

As an illustration, we could imagine that the network may converge to the following features:

  • the key k_i of a vehicle contains e.g. its predicted position in 2 seconds
  • the query q of the ego-vehicle also contains its own predicted position in 2 seconds
  • the value v_i of a vehicle contains some useful properties for subsequent decisions, like its heading angle, speed, past behavior, presence of an ML researcher among the passenger, etc.

That way, the ego-vehicle is able to select, out of all vehicles in the scene, only those that are likely to collide in 2s (their predicted position is close to its own), and compute their aggregated features.

And yes, I used separate weights L_k, L_v for the ego-vehicle and the other vehicles, to allow for more flexibility, but I'm not sure it makes a difference, I haven't actually benchmarked it.

Question 2.

The green and blue heads are simply the first and second head of the self-attention layer. Each head implements the mechanism describes above, so that they can learn different features to look at in the key-value matching. e.g. one head may look for vehicles headed towards the east, another to vehicles driving very fast, etc. But again, these features emerge throughout training. I did observe a directional specialization in a few training runs, but of course by changing the environment, or the number of heads, or simply by following a different training trajectory we may end up with different functions implemented by each head. So, no, you cannot really decide which head looks at which direction, you can only try to interpret what these layers actually do after training, by visualizing their attention scores.

eleurent avatar May 04 '21 09:05 eleurent

Thankyou for the reply, I have a much clearer picture now :)

skynox03 avatar May 04 '21 11:05 skynox03