tsfresh icon indicating copy to clipboard operation
tsfresh copied to clipboard

extract_features may produce duplicated columns

Open nagiton opened this issue 3 years ago • 2 comments

The problem:

When I tried to apply extract_features, it produced duplicated columns.

My script is like below,

extract_features(timeseries, column_id="id", column_sort="time")

(I am sorry not to show the data I used because of confidentiality)

In my case, output df had two duplicated __value_count__value_1 colums.

Duplicated features can cause InvalidIndexError when applying tsfresh.utilities.dataframe_functions.impute.

I think it is better to be fixed.

Anything else we need to know?:

Environment:

  • Python version: 3.7.4
  • Operating System: ubuntu 16.04.6 LTS
  • tsfresh version: 0.18.0
  • Install method (conda, pip, source): pip

nagiton avatar May 12 '21 07:05 nagiton

Hi @nagiton Thanks for your bug report. Unfortunately, I can not reproduce the issue. I tried with the example data:

from tsfresh import extract_features
from tsfresh.examples import load_robot_execution_failures

df, _ = load_robot_execution_failures()

df_extracted = extract_features(df, column_id="id", column_sort="time")

# Check for duplicated columns
assert len(list(df_extracted.columns)) == len(set(df_extracted.columns))

I assume that the bug is independent of the actual data, so maybe you could try to reproduce your bug with only some small amount of test data? Maybe make sure to use the same column names as your original data.

nils-braun avatar May 14 '21 15:05 nils-braun

Extracting features from a df returns a feature_df with duplicate column values with different column names. E.g. 'temp__maximum' & 'temp__absolute__maximum' Is it ok to drop these duplicate columns?

Chima-21 avatar Jun 20 '22 16:06 Chima-21