LightGBM
LightGBM copied to clipboard
Share LightGBM categorical variable representation across multiple columns
Summary
Share LightGBM categorical variable representation across multiple columns.
Motivation
Very useful for sequences and several use cases.
Description
Encode the same category level along multiple columns.
Very similar to using an Embedding Layer in a Neural Network to encode a categorical variable and applying the layer across a sequence of that same variable (example of a word embedding).
The same can be achieved with Target Encoding or other representation applied before modeling by stacking the category columns, into a key-value structure, and then applying said encoding.
Example train dataset
Var 1 | Var 2 | Var 3 | Target |
---|---|---|---|
Goose | Cat | Dog | 103 |
Cat | Cat | Dog | 4 |
Goose | Goose | Goose | 300 |
Stack shared category levels
Category Level | Target |
---|---|
Goose | 103 |
Cat | 103 |
Dog | 103 |
Cat | 4 |
Cat | 4 |
Dog | 4 |
Goose | 300 |
Goose | 300 |
Goose | 300 |
Naive target encoding
Category Level | Encode |
---|---|
Goose | 250 |
Dog | 53 |
Cat | 37 |