pmdarima icon indicating copy to clipboard operation
pmdarima copied to clipboard

How can I reduce memory usage when fit auto arima ?

Open eromoe opened this issue 5 years ago • 4 comments

Hello,

I am trying to use auto_arima search best model with 20 thousand timesries (no exogenous variables) , this size is quite smaller than another datatset(nearly a million) trained with fbprophet (with some exogenous variables) .

But , to my suprise , spark throw memory error , I have gave each executor core over 2GB memory . Traning code is very simple :

        pipeline = Pipeline([
            ("boxcox", BoxCoxEndogTransformer()),
            ("model", pm.AutoARIMA(start_p=1, start_q=1, max_p=3, max_q=3, m=12, start_P=0, seasonal=True, d=1, D=1, trace=True, error_action='ignore', stepwise=True, suppress_warnings=True))
        ])
        pipeline.fit(X['y'].to_numpy() + 1)

Is there some setting I missed can reduce the memory usage in trainning?

eromoe avatar Jun 17 '20 09:06 eromoe

When you say 20 thousand timeseries do you mean 20k samples? Can you please provide a bit more information, like how you're triggering these model fits on Spark executors, and what the stacktrace looks like?

tgsmith61591 avatar Jun 18 '20 12:06 tgsmith61591

They are real data , length vary from 1~400 ( most are 400) .

main trainning steps example


def train_model(d):
    X = d['data']

    pipeline = Pipeline([
        ("boxcox", BoxCoxEndogTransformer()),
        ("model", pm.AutoARIMA(start_p=1, start_q=1, max_p=3, max_q=3, m=12, start_P=0, seasonal=True, d=1, D=1, trace=True, error_action='ignore', stepwise=True, suppress_warnings=True))
    ])
    pipeline.fit(X['y'].to_numpy() + 1)

    d['model'] = pipeline

    return d


def pickle_model(d):
    model = d['model']
    d['model_pickled'] = bytearray(dill.dumps(model))

    return {
        "store_id": d['store_id'], 
        "product_id": d['product_id'], 
        "model_pickled": d['model_pickled'], 
        "train_days": d.get('train_days'),
    }

rdd map:

df = load_data(spark, ...)

df1 = (df.rdd
      .map(lambda r: r.asDict())
      .map(lambda d: transform_data(d))
      .filter(lambda d: len(d['data']) > min_train_length)
      .map(lambda d: train_model(d))
      .map(lambda d: pickle_model(d))
)

schema = StructType([*[ StructField(i, StringType(), True) for i in group_cols], 
StructField('model_pickled', BinaryType(), True)])
df2 = spark.createDataFrame(df1, schema)

df2.write.parquet(output_path, mode='overwrite') 

In fbprophet, it has some redundant property like model.history , set model.history=-1 would reduce much storage . So I wonder if there something similar in pmdarima .

eromoe avatar Jun 19 '20 02:06 eromoe

I think some of the recent changes in #359 and #361 might help with this. Hoping to get v1.7.0 out in the near future.

tgsmith61591 avatar Jul 15 '20 12:07 tgsmith61591

n_samples : 933

>>> ms = pickle.dumps(m)
>>> len(ms)/1024**2
96.15693759918213

Just for record. Haven't tested new version yet .

eromoe avatar Jul 24 '20 10:07 eromoe