etna
etna copied to clipboard
Speed up columns slices: `etna.datasets.utils.select_columns`
🚀 Feature Request
In a lot of places we use df.loc[:, pd.IndexSlice[segments, column]]
to select column
from all the segments. It appears to be very slow on a lot of segments.
We should find places where we use it and make sure that it can be replaced with df.loc[:, pd.IndexSlice[:, column]]
without problems.
Where was some problem with the second choice: #188. We should investigate is it still existing and in which conditions:
- Is it applicable for selection only one column? (
SklearnTransform
selects many) - Can it be avoided by some trick in taking slices (sorting columns for example).
Proposal
- Find all places with slow slice
df.loc[:, pd.IndexSlice[segments, column]]
where column is scalar. Replace them with function (you can add itetna.datasets.utils
). Try to replace slow slice in function with fast slice:df.loc[:, pd.IndexSlice[:, column]
. Make sure that in that case we don't have reordering of columns in different pandas versions. - Do the same but with list of values in
column
(e.g.SklearnTransform
) and investigate reordering issue during testing. We want to avoid it without putting all the segments into the slice. - Make some benchmarking that changed transforms (or other calls) become faster. Add code for benchmarking and its results in the comments of PR. E.g. you can take dataframe with 50000 segments, 100 timestamps, 5 additional int columns, 5 additional float columns, 5 additional category columns.
Test cases
- Make sure that current tests pass for scalar case.
- Make sure that current tests pass for list case.
- Add tests on function for selection of one column.
- Add tests on function for selection of multiple columns (in
SklearnTransform
we had some tests on reordering, it can be useful).
Additional context
No response
And here
I'll try to explain the core of the issue.
We have a wide dataframe df
. We want to select a few columns: [column_1, column_2]
:
res = df.loc[:, pd.IndexSlice[:, [column_1, column_2]]]
In pandas 1.1*: we will get a dataframe where at the last index column_1
and column_2
become ordered by its index in df
. So, if column_2
goes in df
before column_1
we will get them in unexpected order where first we get values from column_2
and then from column_1
. Names of the columns ordered like the values itself.
In pandas 1.1.* and >= 1.2: we will get columns in order that we gave to loc
.
If we make selection like:
res = df.loc[:, pd.IndexSlice[segments, [column_1, column_2]]]
then in both cases we get an order from loc.
More detailed results. Imagine we have a df_wide
with segments: ["segment_2", "segment_1", "segment_0"]
and with features: ["target", "exog_2", "exog_1", "exog_0"]
.
Calling df_wide.loc[:, pd.IndexSlice[:, ["exog_1", "exog_2"]]]
gives order of columns:
-
pandas=1.1.5
:
-
segment_2/exog_2
-
segment_2/exog_1
-
segment_0/exog_2
-
segment_0/exog_1
-
segment_1/exog_2
-
segment_1/exog_1
-
pandas=1.3.5
:
-
segment_2/exog_1
-
segment_0/exog_1
-
segment_1/exog_1
-
segment_2/exog_2
-
segment_0/exog_2
-
segment_1/exog_2
Calling df.loc[:, pd.IndexSlice[["segment_2", "segment_0", "segment_1"], ["exog_1", "exog_2"]]]
gives order of columns:
-
pandas=1.1.5
:
-
segment_2/exog_1
-
segment_2/exog_2
-
segment_0/exog_1
-
segment_0/exog_2
-
segment_1/exog_1
-
segment_1/exog_2
-
pandas=1.3.5
:
-
segment_2/exog_1
-
segment_2/exog_2
-
segment_0/exog_1
-
segment_0/exog_2
-
segment_1/exog_1
-
segment_1/exog_2
- If we don't give segments we have different results on different pandas versions.
- If we give segments, the results are the same.