tsfresh icon indicating copy to clipboard operation
tsfresh copied to clipboard

Problem with the function extract_features

Open AlessioBolpagni98 opened this issue 1 year ago • 7 comments

The problem: I have a script that run everyday and in this script i use the tsfresh function extract_features(), but sometimes the script remain stucked in the function with the progressbar blocked at a certain percentage. The function doesn't raise any excpetion and the code remain blocked.
Packages (1).txt

  • Python version: 3.10.12
  • Operating System: Linux Ubuntu 22.04.3
  • tsfresh version: 0.20.1
  • Install method (conda, pip, source): pip

AlessioBolpagni98 avatar Dec 28 '23 12:12 AlessioBolpagni98

Hi @AlessioBolpagni98 ! Is this deterministic - meaning: it always gets stuck with the same data? Do you see a certain pattern in which data it gets stuck? And which feature calculators are you using?

nils-braun avatar Jan 03 '24 08:01 nils-braun

i encountered the similar issues, my raw dataframe has 1k ids, 27k rows, 140 features. it can be well done by full feature extraction with MultiprocessingDistributor(n_workers=12) on a 64GB machine within 30 mins. but it always hanged with ClusterDaskDistributor with 4 nodes of 64GB workers. and i noticed that it hanged in the result gathering step. after about 4 hours, the extract_features job will be killed out of memory. my environment is : python 3.10.12 tsfresh:0.20.2 dask:2024.7.0 pandas:2.2.2 OS:ubuntu 22.04.1 LTS (Jammy Jellyfish)

sidneyzhu avatar Jul 16 '24 13:07 sidneyzhu

@AlessioBolpagni98 have you fixed this issue?

sidneyzhu avatar Jul 17 '24 13:07 sidneyzhu

My problem was that i was using the extract_features() function in a improper way. i was using the same column for the parameter 'column_id' and 'column_sort'.

This was my problematic function:

def get_features(df_BTC):
    """extract features using TSfresh, return a dataframe with the features"""
    df_BTC = df_BTC.reset_index(drop=False)
    # Estrae le caratteristiche
    # Retry the function up to 3 times
    params = {
        "timeseries_container": df_BTC,
        "column_sort": "Date",
        "column_id": "Date",
    }

    extracted_features = extract_features(**params)
    impute(extracted_features)  # inplace

    cols_zero = []  # rimuovi le feature con dev. st. nulla
    for col in extracted_features.columns:
        if extracted_features[col].std() == 0:
            cols_zero.append(col)
    extracted_features_pulito = extracted_features.drop(columns=cols_zero)
    extracted_features_pulito["Date"] = df_BTC["Date"]

    return extracted_features_pulito

AlessioBolpagni98 avatar Jul 17 '24 17:07 AlessioBolpagni98

to solve this in my case all the rows must have the same ID, so i have created an ID 'A' for all the rows

AlessioBolpagni98 avatar Jul 17 '24 17:07 AlessioBolpagni98

thanks for your reply. in my case , the code can be finished by multiprocess of n_jobs=8 in about 30mins, but it can't be finished in clustered 8 workers on different machines.

sidneyzhu avatar Jul 18 '24 02:07 sidneyzhu

i fixed it by shifting to dask_feature_extraction_on_chunk(), the ClusterDaskDistributor still failed with a lot of communication errors

sidneyzhu avatar Jul 18 '24 11:07 sidneyzhu