How can I reduce memory usage when fit auto arima ?
Hello,
I am trying to use auto_arima search best model with 20 thousand timesries (no exogenous variables) , this size is quite smaller than another datatset(nearly a million) trained with fbprophet (with some exogenous variables) .
But , to my suprise , spark throw memory error , I have gave each executor core over 2GB memory . Traning code is very simple :
pipeline = Pipeline([
("boxcox", BoxCoxEndogTransformer()),
("model", pm.AutoARIMA(start_p=1, start_q=1, max_p=3, max_q=3, m=12, start_P=0, seasonal=True, d=1, D=1, trace=True, error_action='ignore', stepwise=True, suppress_warnings=True))
])
pipeline.fit(X['y'].to_numpy() + 1)
Is there some setting I missed can reduce the memory usage in trainning?
When you say 20 thousand timeseries do you mean 20k samples? Can you please provide a bit more information, like how you're triggering these model fits on Spark executors, and what the stacktrace looks like?
They are real data , length vary from 1~400 ( most are 400) .
main trainning steps example
def train_model(d):
X = d['data']
pipeline = Pipeline([
("boxcox", BoxCoxEndogTransformer()),
("model", pm.AutoARIMA(start_p=1, start_q=1, max_p=3, max_q=3, m=12, start_P=0, seasonal=True, d=1, D=1, trace=True, error_action='ignore', stepwise=True, suppress_warnings=True))
])
pipeline.fit(X['y'].to_numpy() + 1)
d['model'] = pipeline
return d
def pickle_model(d):
model = d['model']
d['model_pickled'] = bytearray(dill.dumps(model))
return {
"store_id": d['store_id'],
"product_id": d['product_id'],
"model_pickled": d['model_pickled'],
"train_days": d.get('train_days'),
}
rdd map:
df = load_data(spark, ...)
df1 = (df.rdd
.map(lambda r: r.asDict())
.map(lambda d: transform_data(d))
.filter(lambda d: len(d['data']) > min_train_length)
.map(lambda d: train_model(d))
.map(lambda d: pickle_model(d))
)
schema = StructType([*[ StructField(i, StringType(), True) for i in group_cols],
StructField('model_pickled', BinaryType(), True)])
df2 = spark.createDataFrame(df1, schema)
df2.write.parquet(output_path, mode='overwrite')
In fbprophet, it has some redundant property like model.history , set model.history=-1 would reduce much storage . So I wonder if there something similar in pmdarima .
I think some of the recent changes in #359 and #361 might help with this. Hoping to get v1.7.0 out in the near future.
n_samples : 933
>>> ms = pickle.dumps(m)
>>> len(ms)/1024**2
96.15693759918213
Just for record. Haven't tested new version yet .