LightGBM
LightGBM copied to clipboard
Support select a subset of features from a constructed dataset
Summary
Support select a subset of features from a constructed dataset.
lgb.Dataset.subset(used_indices: List[int], used_columns: List[str]=None, params=None) -> lgb.Dataset
Motivation
The Dataset object of LightGBM is awesome because it's super memory-efficient. So I build a workflow around the Dataset like this:
- Construct Dataset from a huge dataframe with many features, via the lgb.Sequence API.
- Train a LightGBM model on the Dataset, and select the top features from the huge dataframe.
- Construct another Dataset on the selected dataframe.
- Train a LightGBM model on the selected Dataset.
However, the step 2 and 3 require many I/O operations and become the bottleneck in my experiments, so I wonder if we can directly select a subset of features from a constructed dataset. If we have that API, the above workflow will be:
- Construct Dataset from a huge dataframe via the lgb.Sequence API.
- Train a LightGBM model on the Dataset, then select the top features from the Dataset directly and construct a new smaller Dataset.
- Train a LightGBM model on the selected Dataset.
Description
I think this feature selection demand is common on handling huge number of features after a dataset is constructed. In my case, because the dataframe is too large to live in the RAM, it takes a long time to select the raw dataframe (~30 min) and reconstruct another dataset from it (~30 min), while the final training only takes about 20 min.
I've read through the existing parameters and APIs. And I think the closest 3 features are:
- Dataset parameter:
ignore_column
. But it works only in case of loading data directly from text file. I've read the source code and it skips columns before constructing them. - Python API:
lgb.Dataset.add_features_from()
. But it can only add features, not remove features. - Python API:
lgb.Dataset.subset()
. But it only works for selecting rows, not columns.
I think it's reasonable to implement this feature as an optional argument for the existing .subset()
API. So it looks like:
lgb.Dataset.subset(used_indices: List[int], used_columns: List[str]=None, params=None) -> lgb.Dataset
And a toy example will be:
df = pd.DataFrame(np.array([[1,2,3],[4,5,6]], columns=[f'feat_{i}' for i in range(3)])
dataset = lgb.Dataset(df).construct()
dataset_selected = dataset.subset(columns = ['feat_1', 'feat_3'])
Thoughts on implementation
On one side, I understand this request is non-trivial, because there are special handling on sparse features and catergorical features. In my case, all features are dense and numeric, so selecting some of them makes sense.
I think to support this API, we need at least to check that:
is_enable_sparse == False
enable_bundle == False
On the other side, I think the request is not impossible. Because AddFeaturesFrom in dataset.cpp has already implemented many functionalities like resize the feature_names and feature_map.
Hi, I found this request useful as well.
My main goal is to evaluate using a group field, similar to this issue posted here 4995.
My problem is that I want to ignore this additional "id" field in the attached post as it is not a feature I can train on. As mentioned in this post and in your python package documentation: "ignore_column works only in case of loading data directly from text file" (which is not possible in my use of loading parquet files) I Would appreciate your advice or a workaround for this issue.
Can we not do lgb.Dataset.data.drop(columns=[list of columns to remove])
?