etna Create notebook about feature selection

🚀 Feature Request

Create notebook demonstrating our method for feature selection

Motivation

Show our feature selection methods to the users

Proposal

Create notebook with the short description and demonstration of out feature selection transforms(TreeFeatureSelectionTransform, GaleShapleyFeatureSelectionTransform, MRMRFeatureSelectionTransform)
Include also plot_feature_relevance method here

Test cases

No response

Alternatives

No response

Additional context

No response

Checklist

[ ] I discussed this issue with ETNA Team

May 30 '22 13:05 alex-hse-repository

sudo py-spy record -o speedscope.json -f speedscope python f.py --rate 50 --nonblocking


# %% [markdown]
# # Feature selection
# 
# This notebook contains the simple examples of using feature extractor transforms with ETNA library.
# 
# ### Navigation
# 
# - [Intro](#20-intro-to-feature-selection)
# - [TreeFeatureSelectionTransform](#21-tree)
# - [GaleShapleyFeatureSelectionTransform](#21-galeshapleyfeatureselectiontransform)
# - [MRMRFeatureSelectionTransform](#22-mrmrfeatureselectiontransform)
# 

# %%
import warnings

warnings.filterwarnings("ignore")

# %% [markdown]
# ## 1. Load Dataset
# 
# We are going to work with the time series from Tabular Playground Series - Jan 2022. The dataset contains daily merchandise sales – mugs, hats, and stickers – at two imaginary store chains across three Scandinavian countries. As exogenous data, we will use Finland, Norway, and Sweden Weather Data 2015-2019 dataset containing daily country average precipitation, snow depth and air temperature data.

# %%
import pandas as pd
import warnings

warnings.filterwarnings("ignore")

df = pd.read_csv("examples/data/nordic_merch_sales.csv")

# %%
from etna.datasets import TSDataset

df = TSDataset.to_dataset(df)
ts = TSDataset(df, freq="D")
ts.plot(4)

# %%
HORIZON = 60

# %% [markdown]
# ## 2. Feature selection methods
# 
# ### 2.0 Intro to feature selection
# 
# Let's create features and build pipeline with dataset:

# %%
from etna.pipeline import Pipeline
from etna.models import CatBoostModelPerSegment
from etna.transforms import (
    DateFlagsTransform,
    MeanTransform,
    LagTransform,
    TrendTransform,
    FourierTransform,
    HolidayTransform,
)
from etna.metrics import SMAPE

transforms = [
    TrendTransform(in_column="target", out_column="trend"),
    LagTransform(in_column="target", lags=range(HORIZON, 100), out_column="target_lag"),
    DateFlagsTransform(
        day_number_in_month=True, day_number_in_week=False, is_weekend=False, out_column="datetime_flag"
    ),
    MeanTransform(in_column=f"target_lag_{HORIZON}", window=12, seasonality=7, out_column="mean_transform"),
    FourierTransform(period=250, order=6, out_column="fourier"),
    HolidayTransform(iso_code="SWE", out_column="SWE_holidays"),
    HolidayTransform(iso_code="NOR", out_column="NOR_holidays"),
    HolidayTransform(iso_code="FIN", out_column="FIN_holidays"),
]

# %% [markdown]
# With this simple transform we improved SMAPE and backtest time in more than twice.
# 
# ETNA also provides methods to plot importance of each feature:

# %%
from etna.transforms import GaleShapleyFeatureSelectionTransform
from etna.analysis.feature_relevance import StatisticsRelevanceTable

rt = StatisticsRelevanceTable()
feature_selector_transform = GaleShapleyFeatureSelectionTransform(top_k=20, relevance_table=rt, return_features=True)


pipeline = Pipeline(
    model=CatBoostModelPerSegment(), transforms=transforms + [feature_selector_transform], horizon=HORIZON
)
metrics_galeshapley_feature_selector, forecast_galeshapley_feature_selector, _ = pipeline.backtest(
    ts=ts, metrics=[SMAPE()], n_folds=1
)