handson-ml3
handson-ml3 copied to clipboard
[QUESTION]
I'm a bit confused about the "Encoding Categorical Features Using Embeddings" on Chapter 13,
There you provided an example about encoding a categorical data, and said that 50,000 one-hot encoded categories is equivalent to around 100 embedding dimensions... But the sample you provided isn't clear enough to understand the comparison...
The code you provided just output an Tensor shape=(3, 2) of the categories, and what else? (I expected a complete Model that outputs some real training data with compile and fit, so we can see the difference).
Can you provide a link to a more clear comparison between one-hot encoding and Embeddings? Where should I use one or the other?
Thanks for sharing your knowledge with 3º version of the book, very impressive!
Hi @marcelmenezes,
Thanks for your question.
Suppose you want to encode a categorical feature, such as a user's country of origin. Suppose there are just 4 possible countries. The most naive approach would be to just encode them as a single integer value: the first country would be 1, the second would be 2, and so on. The problem with this approach is that most types of models (including neural nets, linear models, random forests, and so on) implicitly assume that similar values represent similar things: therefore, by encoding the countries using this naive ordinal approach, we would bias the model towards "thinking" that countries 1 and 2 are more similar than countries 1 and 4.
To avoid biasing the model this way, you could instead use one-hot encoding. Country one would be represented as the 4-dimensional vector [1, 0, 0, 0], country two would be represented as [0, 1, 0, 0], country three as [0, 0, 1, 0], and country four as [0, 0, 0, 1]. Since all of these vectors are the same distance from one-another, we're not introducing any bias. Great! However, this approach only works when there's a limited number of categories. If there were 200 possible countries instead of just 4, we would have to represent each country using a 200-dimensional vector. This would be inefficient and it would require many more parameters in our models. For example, the first layer of a dense neural network requires one parameter per input and per neuron (plus one bias term per neuron, but I'm ignoring those here), so if the first layer in the neural net had 100 neurons, we would have 100*200 = 20,000 parameters just to handle the country input! Not great. The more parameters, the more training data you need, and the longer training will take. We need a better approach.
Another approach could be to try feature engineering: perhaps you could replace the country with whatever relevant metrics you can find about that country. For example, suppose the model is trying to predict how likely the user is to subscribe to your company's healthcare services, perhaps you know that users from richer countries are more likely to subscribe, and users from countries with good public healthcare are less likely to subscribe. If so, then rather than feeding your model a one-hot vector representing the country, you could instead feed it a pair of scores representing how rich the user's country is, and how good its public healthcare system is. For example, one country may be represented as [0.8, 0.2] (meaning pretty rich country, with a lousy healthcare system), and another country may be represented as [0.5, 0.7] (meaning not super rich, but a pretty good healthcare system). This representation is much more efficient than the one-hot vector representation, since it's just a 2-dimensional vector (in this example). It doesn't need to learn that users from country one are not likely to subscribe, and users from country two are likely to subscribe, and so on for all 200 countries. Instead, it just needs to learn that richer = more likely to subscribe, and better public healthcare = less likely. That's much easier to learn. However, the downside of this approach is that you need to perform feature engineering, it requires expert knowledge, and you need to find relevant metrics that can replace your categorical inputs.
With embeddings, you also replace your categorical inputs with a small vector like [0.2, 0.8], but this time you don't have to come up with these numbers yourself. You just let the neural network learn good values for each country. If you let it train for long enough on a large enough dataset, it may end up figuring out on its own the rules about country wealth and public healthcare level. It may discover other rules as well. However, the vectors (embeddings) representing each country will not be immediately interpretable, unlike in the previous approach. But you may find that rich countries are clustered in embedding space, for example. Inspecting the embeddings may in fact teach you something about which country features really matter for your predictions.
Now let's look at the code example in the book:
>>> embedding_layer = tf.keras.layers.Embedding(input_dim=5, output_dim=2)
>>> embedding_layer(np.array([2, 4, 2]))
<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[-0.04663396, 0.01846724],
[-0.02736737, -0.02768031],
[-0.04663396, 0.01846724]], dtype=float32)>
In this example, we create an embedding layer which will encode a categorical feature with 5 categories (input_dim=5
), and will produce trainable 2D embeddings (output_dim=2
). For example, you could use this layer to encode countries, if there were just 5 possible countries. Then we use this layer to encode 3 samples. Going with the country example, we're encoding countries 2, 4, and again 2. The output is a tensor containing the 3 embeddings corresponding to these countries. Note that the first and last row are identical, since we encoded country 2 twice.
Right now, these embeddings are just random, since the layer gets initialized randomly. However, if we use this layer in a model, and we train that model, then the embeddings should get better and better during training.
The book contains another example on the same page, using the Embedding
layer in a model for the California housing task (chapter 2):
>>> ocean_prox = ["<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", "ISLAND"]
>>> str_lookup_layer = tf.keras.layers.StringLookup()
>>> str_lookup_layer.adapt(ocean_prox)
>>> lookup_and_embed = tf.keras.Sequential([
... str_lookup_layer,
... tf.keras.layers.Embedding(input_dim=str_lookup_layer.vocabulary_size(),
... output_dim=2)
... ])
...
>>> lookup_and_embed(np.array([["<1H OCEAN"], ["ISLAND"], ["<1H OCEAN"]]))
In this example, we start by creating a StringLookup
layer, which we will use to convert the ocean_proximity
feature from strings to indices (e.g., "<1H OCEAN"
will be converted to 0, "INLAND"
will be converted to 1, and so on). We then adaot this layer by giving it a data sample. Then we create a Sequential
model, starting with the StringLookup
layer we just created, and then continuing with an Embedding
layer: this sequential model will take the ocean_proximity
strings as input and it will output the corresponding embeddings. It's a preprocessing model.
On the following page, this preprocessing model is used inside a complete neural net model, along with a Dense
layer. When we train that model, the Embedding
layer gets trained as well, so the embedding vectors get better and better. After training, when we use the model to make predictions, the embedding vectors no longer change.
The number of dimensions for your embeddings (i.e., output_dim
) is a hyperparameter you can tune. If there are many categories, it's likely that you will need more dimensions. As a rule of thumb, the number of dimensions you need is generally proportional to the log of the number of categories. But this is not a hard rule, it depends on the task.
I hope this clarifies things a bit?
@ageron A little dificult to visualize this mentally, because I used to train neurons to help reach the targets, not to train the data attributes themselves...
Before your explanation, I thought the NN for the Embedding NN was trained before and apart from the "final" NN (in a way that the NN-final was fed by the output of the first NN-Embedding, along with the remaining attributes).
But your penultimate paragraph Github reply was very clarifying about the simultaneously training of the Embedding inside the same NN of the whole feeded data.
That Openai ChatGPT tool is helping me a lot to solve TensorFlow 2.0 misswrited code of mine. I hope they don't replace our Data Science work any time soon!
Thanks Ageron!