LightGBM icon indicating copy to clipboard operation
LightGBM copied to clipboard

Share LightGBM categorical variable representation across multiple columns

Open carlosg-m opened this issue 5 months ago • 2 comments

Summary

Share LightGBM categorical variable representation across multiple columns.

Motivation

Very useful for sequences and several use cases.

Description

Encode the same category level along multiple columns.

Very similar to using an Embedding Layer in a Neural Network to encode a categorical variable and applying the layer across a sequence of that same variable (example of a word embedding).

The same can be achieved with Target Encoding or other representation applied before modeling by stacking the category columns, into a key-value structure, and then applying said encoding.

Example train dataset

Var 1 Var 2 Var 3 Target
Goose Cat Dog 103
Cat Cat Dog 4
Goose Goose Goose 300

Stack shared category levels

Category Level Target
Goose 103
Cat 103
Dog 103
Cat 4
Cat 4
Dog 4
Goose 300
Goose 300
Goose 300

Naive target encoding

Category Level Encode
Goose 250
Dog 53
Cat 37

carlosg-m avatar Aug 30 '24 11:08 carlosg-m