LightGBM icon indicating copy to clipboard operation
LightGBM copied to clipboard

Support select a subset of features from a constructed dataset

Open EdmondElephant opened this issue 2 years ago • 2 comments

Summary

Support select a subset of features from a constructed dataset.

lgb.Dataset.subset(used_indices: List[int], used_columns: List[str]=None, params=None) -> lgb.Dataset

Motivation

The Dataset object of LightGBM is awesome because it's super memory-efficient. So I build a workflow around the Dataset like this:

  1. Construct Dataset from a huge dataframe with many features, via the lgb.Sequence API.
  2. Train a LightGBM model on the Dataset, and select the top features from the huge dataframe.
  3. Construct another Dataset on the selected dataframe.
  4. Train a LightGBM model on the selected Dataset.

However, the step 2 and 3 require many I/O operations and become the bottleneck in my experiments, so I wonder if we can directly select a subset of features from a constructed dataset. If we have that API, the above workflow will be:

  1. Construct Dataset from a huge dataframe via the lgb.Sequence API.
  2. Train a LightGBM model on the Dataset, then select the top features from the Dataset directly and construct a new smaller Dataset.
  3. Train a LightGBM model on the selected Dataset.

Description

I think this feature selection demand is common on handling huge number of features after a dataset is constructed. In my case, because the dataframe is too large to live in the RAM, it takes a long time to select the raw dataframe (~30 min) and reconstruct another dataset from it (~30 min), while the final training only takes about 20 min.

I've read through the existing parameters and APIs. And I think the closest 3 features are:

  1. Dataset parameter: ignore_column. But it works only in case of loading data directly from text file. I've read the source code and it skips columns before constructing them.
  2. Python API: lgb.Dataset.add_features_from(). But it can only add features, not remove features.
  3. Python API: lgb.Dataset.subset(). But it only works for selecting rows, not columns.

I think it's reasonable to implement this feature as an optional argument for the existing .subset() API. So it looks like:

lgb.Dataset.subset(used_indices: List[int], used_columns: List[str]=None, params=None) -> lgb.Dataset

And a toy example will be:

df = pd.DataFrame(np.array([[1,2,3],[4,5,6]], columns=[f'feat_{i}' for i in range(3)])
dataset = lgb.Dataset(df).construct()
dataset_selected = dataset.subset(columns = ['feat_1', 'feat_3'])

Thoughts on implementation

On one side, I understand this request is non-trivial, because there are special handling on sparse features and catergorical features. In my case, all features are dense and numeric, so selecting some of them makes sense.

I think to support this API, we need at least to check that:

is_enable_sparse == False
enable_bundle == False

On the other side, I think the request is not impossible. Because AddFeaturesFrom in dataset.cpp has already implemented many functionalities like resize the feature_names and feature_map.

EdmondElephant avatar Aug 22 '22 14:08 EdmondElephant

Hi, I found this request useful as well.
My main goal is to evaluate using a group field, similar to this issue posted here 4995.

My problem is that I want to ignore this additional "id" field in the attached post as it is not a feature I can train on. As mentioned in this post and in your python package documentation: "ignore_column works only in case of loading data directly from text file" (which is not possible in my use of loading parquet files) I Would appreciate your advice or a workaround for this issue.

advahadr avatar Mar 08 '23 15:03 advahadr

Can we not do lgb.Dataset.data.drop(columns=[list of columns to remove])?

sumeetkr13 avatar Feb 22 '24 20:02 sumeetkr13