etna
etna copied to clipboard
Create notebook about feature selection
🚀 Feature Request
Create notebook demonstrating our method for feature selection
Motivation
Show our feature selection methods to the users
Proposal
- Create notebook with the short description and demonstration of out feature selection transforms(TreeFeatureSelectionTransform, GaleShapleyFeatureSelectionTransform, MRMRFeatureSelectionTransform)
- Include also plot_feature_relevance method here
Test cases
No response
Alternatives
No response
Additional context
No response
Checklist
- [ ] I discussed this issue with ETNA Team
sudo py-spy record -o speedscope.json -f speedscope python f.py --rate 50 --nonblocking
# %% [markdown]
# # Feature selection
#
# This notebook contains the simple examples of using feature extractor transforms with ETNA library.
#
# ### Navigation
#
# - [Intro](#20-intro-to-feature-selection)
# - [TreeFeatureSelectionTransform](#21-tree)
# - [GaleShapleyFeatureSelectionTransform](#21-galeshapleyfeatureselectiontransform)
# - [MRMRFeatureSelectionTransform](#22-mrmrfeatureselectiontransform)
#
# %%
import warnings
warnings.filterwarnings("ignore")
# %% [markdown]
# ## 1. Load Dataset
#
# We are going to work with the time series from Tabular Playground Series - Jan 2022. The dataset contains daily merchandise sales – mugs, hats, and stickers – at two imaginary store chains across three Scandinavian countries. As exogenous data, we will use Finland, Norway, and Sweden Weather Data 2015-2019 dataset containing daily country average precipitation, snow depth and air temperature data.
# %%
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv("examples/data/nordic_merch_sales.csv")
# %%
from etna.datasets import TSDataset
df = TSDataset.to_dataset(df)
ts = TSDataset(df, freq="D")
ts.plot(4)
# %%
HORIZON = 60
# %% [markdown]
# ## 2. Feature selection methods
#
# ### 2.0 Intro to feature selection
#
# Let's create features and build pipeline with dataset:
# %%
from etna.pipeline import Pipeline
from etna.models import CatBoostModelPerSegment
from etna.transforms import (
DateFlagsTransform,
MeanTransform,
LagTransform,
TrendTransform,
FourierTransform,
HolidayTransform,
)
from etna.metrics import SMAPE
transforms = [
TrendTransform(in_column="target", out_column="trend"),
LagTransform(in_column="target", lags=range(HORIZON, 100), out_column="target_lag"),
DateFlagsTransform(
day_number_in_month=True, day_number_in_week=False, is_weekend=False, out_column="datetime_flag"
),
MeanTransform(in_column=f"target_lag_{HORIZON}", window=12, seasonality=7, out_column="mean_transform"),
FourierTransform(period=250, order=6, out_column="fourier"),
HolidayTransform(iso_code="SWE", out_column="SWE_holidays"),
HolidayTransform(iso_code="NOR", out_column="NOR_holidays"),
HolidayTransform(iso_code="FIN", out_column="FIN_holidays"),
]
# %% [markdown]
# With this simple transform we improved SMAPE and backtest time in more than twice.
#
# ETNA also provides methods to plot importance of each feature:
# %%
from etna.transforms import GaleShapleyFeatureSelectionTransform
from etna.analysis.feature_relevance import StatisticsRelevanceTable
rt = StatisticsRelevanceTable()
feature_selector_transform = GaleShapleyFeatureSelectionTransform(top_k=20, relevance_table=rt, return_features=True)
pipeline = Pipeline(
model=CatBoostModelPerSegment(), transforms=transforms + [feature_selector_transform], horizon=HORIZON
)
metrics_galeshapley_feature_selector, forecast_galeshapley_feature_selector, _ = pipeline.backtest(
ts=ts, metrics=[SMAPE()], n_folds=1
)
Maybe we can pass all columns to mann whitney test. current implementation compares features separately.
Waiting for #886.
Closed by #875.