etna icon indicating copy to clipboard operation
etna copied to clipboard

Speed up columns slices: `etna.datasets.utils.select_columns`

Open Mr-Geekman opened this issue 2 years ago • 2 comments

🚀 Feature Request

In a lot of places we use df.loc[:, pd.IndexSlice[segments, column]] to select column from all the segments. It appears to be very slow on a lot of segments.

We should find places where we use it and make sure that it can be replaced with df.loc[:, pd.IndexSlice[:, column]] without problems.

Where was some problem with the second choice: #188. We should investigate is it still existing and in which conditions:

  1. Is it applicable for selection only one column? (SklearnTransform selects many)
  2. Can it be avoided by some trick in taking slices (sorting columns for example).

Proposal

  1. Find all places with slow slice df.loc[:, pd.IndexSlice[segments, column]] where column is scalar. Replace them with function (you can add it etna.datasets.utils). Try to replace slow slice in function with fast slice: df.loc[:, pd.IndexSlice[:, column]. Make sure that in that case we don't have reordering of columns in different pandas versions.
  2. Do the same but with list of values in column (e.g. SklearnTransform) and investigate reordering issue during testing. We want to avoid it without putting all the segments into the slice.
  3. Make some benchmarking that changed transforms (or other calls) become faster. Add code for benchmarking and its results in the comments of PR. E.g. you can take dataframe with 50000 segments, 100 timestamps, 5 additional int columns, 5 additional float columns, 5 additional category columns.

Test cases

  1. Make sure that current tests pass for scalar case.
  2. Make sure that current tests pass for list case.
  3. Add tests on function for selection of one column.
  4. Add tests on function for selection of multiple columns (in SklearnTransform we had some tests on reordering, it can be useful).

Additional context

No response

Mr-Geekman avatar Jun 28 '22 07:06 Mr-Geekman

Make sure that you do not forget to fix this, this and this places in TSDataset

alex-hse-repository avatar Jul 22 '22 05:07 alex-hse-repository

And here

alex-hse-repository avatar Jul 29 '22 06:07 alex-hse-repository

And here, here

alex-hse-repository avatar Aug 17 '22 14:08 alex-hse-repository

I'll try to explain the core of the issue. We have a wide dataframe df. We want to select a few columns: [column_1, column_2]:

res = df.loc[:, pd.IndexSlice[:, [column_1, column_2]]]

In pandas 1.1*: we will get a dataframe where at the last index column_1 and column_2 become ordered by its index in df. So, if column_2 goes in df before column_1 we will get them in unexpected order where first we get values from column_2 and then from column_1. Names of the columns ordered like the values itself.

In pandas 1.1.* and >= 1.2: we will get columns in order that we gave to loc.

If we make selection like:

res = df.loc[:, pd.IndexSlice[segments, [column_1, column_2]]]

then in both cases we get an order from loc.

Mr-Geekman avatar Aug 25 '22 09:08 Mr-Geekman

More detailed results. Imagine we have a df_wide with segments: ["segment_2", "segment_1", "segment_0"] and with features: ["target", "exog_2", "exog_1", "exog_0"].

Calling df_wide.loc[:, pd.IndexSlice[:, ["exog_1", "exog_2"]]] gives order of columns:

  1. pandas=1.1.5:
  • segment_2/exog_2
  • segment_2/exog_1
  • segment_0/exog_2
  • segment_0/exog_1
  • segment_1/exog_2
  • segment_1/exog_1
  1. pandas=1.3.5:
  • segment_2/exog_1
  • segment_0/exog_1
  • segment_1/exog_1
  • segment_2/exog_2
  • segment_0/exog_2
  • segment_1/exog_2

Calling df.loc[:, pd.IndexSlice[["segment_2", "segment_0", "segment_1"], ["exog_1", "exog_2"]]] gives order of columns:

  1. pandas=1.1.5:
  • segment_2/exog_1
  • segment_2/exog_2
  • segment_0/exog_1
  • segment_0/exog_2
  • segment_1/exog_1
  • segment_1/exog_2
  1. pandas=1.3.5:
  • segment_2/exog_1
  • segment_2/exog_2
  • segment_0/exog_1
  • segment_0/exog_2
  • segment_1/exog_1
  • segment_1/exog_2
  1. If we don't give segments we have different results on different pandas versions.
  2. If we give segments, the results are the same.

Mr-Geekman avatar Aug 29 '22 16:08 Mr-Geekman