pysc2 icon indicating copy to clipboard operation
pysc2 copied to clipboard

Categorical feature embedding implementation

Open inoryy opened this issue 6 years ago • 4 comments

In the SC2 LE paper there's this sentence under input pre-processing:

We embed all feature layers containing categorical values into a continuous space which is equivalent to using a one-hot encoding in the channel dimension followed by a 1 × 1 convolution.

This raises two question on implementation detail. Let's assume we're dealing with a 64x64 minimap and we want to embed visibility_map (4 levels) and player_relative (5 levels) features.

1.) What Is embedding dimension? That is, what is the number of kernels used in 1x1 conv? i.e. if it's 1 then our final (concatenated) output dimensions would be 64x64x2.

2.) Is embedding done separately per each feature or with one pass-through for all? More specifically:

    1. one-hot on channel -> concat on channel -> 1x1 conv on all features at the same time.
      ex. one-hot to 64x64x4 and 64x64x5 -> concat to 64x64x9 -> 1x1(x2) conv to 64x64x2 output
    1. one-hot on channel -> 1x1 conv separately per feature -> concat on channel ex. one-hot to 64x64x4 and 64x64x5 -> 1x1(x1) conv to 64x64x1 and 64x64x1 -> concat to 64x64x2 output

The big difference between the two is that in first case all features influence output channels at the same time.

inoryy avatar Nov 07 '17 15:11 inoryy

I agree that the "embedding-statement" in the paper can mean two different things.

Here is a gist https://gist.github.com/pekaalto/1549f5dd3d43dc55de2c0d91e857164e which discusses the methods i. and ii. on question 2.) above in more detail.

As seen in the gist (and claimed in the paper), both of these can be represented as embedding look-ups. The difference is that in

  • i. we sum the embeddings on channel axis
  • ii. we concatenate the embeddings on channel axis.

It is not obvious which one was used.

pekaalto avatar Nov 07 '17 15:11 pekaalto

@Inoryy

  1. My suggestion would be to use log2(categories) channels (or log with some other base) for each embedding, as e.g. unit_type has 1850 categories and 1 output channel might not be enough to properly represent this feature. Using a logarithmic (in the number of categories) number of channels should be a safe choice.

  2. I think it makes more sense to do it separately per feature, as this saves some computation when doing it with 1x1 convs and also makes more sense when we think of it as embedding for each feature layer.

@pekaalto I guess concatenating makes more sense if we assume a flexible number of embedding channels for different feature layers and also seems to be the more standard way of combining convnet inputs.

simonmeister avatar Dec 31 '17 19:12 simonmeister

@simonmeister Why do you claim that 2. i) saves computational time over 2. ii) ?
If the output dimension is kept fixed I don't see any meaningful difference in performance. So I think performance is little bit irrelevant here.

About concatenating and summing.
I tried to give different equivalent explanation of methods 2) i. and ii. in the OP. Thinking with embedding lookups is easier than with 1x1-convs and it makes the differences in the methods more obvious, at least for me.

I agree that here the embedding dimension should vary between features and ( 2. ii ) was probably used but this is just speculation (until the paper is clarified). I don't think 2. ii is so much more standard over 2. i that it goes without saying.

pekaalto avatar Jan 01 '18 05:01 pekaalto

@pekaalto Regarding the computational efficiency, i meant it in the following way: let's assume we have 2 feature layers and we want to have a specific total output dimensionality (e.g. 10). And also, let's assume that feature layer 1 has 30 channels when seen as one-hot and that feature layer 2 has 10. If we use two 1x1 convs with e.g. 3 and 7 output channels (and 10, 30 inputs), we have 7x30 + 3x10 = 240 weights. If we use one 1x1 conv with 10 output and 40 input channels we have 10x40 = 400 weights. So the former way seems to be more effective if we assume that the chosen number of output channels per feature layer suffice to represent it's features. Also, embedding a one-hot vector into a continous space is, in my opinion, not equivalent do concatenating multiple one-hot tensors and embedding this concatenated tensor.

Well i guess everything is just speculation until clarified, but it just makes little sense to me to assume that the output dimensions are the same for all feature layers (independent of what method we use in 1.), as e.g. unit_type should have a richer representation than the visibility_map, and it would be very wasteful to use more channels corresponding to a given feature than required.

Also, i disagree that summing different inputs is a reasonable way of combining multiple heterogeneous input channels, as we loose a lot of information. For example, if we have an input image with channels RGB and we sum, we don't know if a certain pixel was 0,0,255 or 255,0,0, which can be rather important.

simonmeister avatar Jan 01 '18 10:01 simonmeister