linearmodels
linearmodels copied to clipboard
Suggestion: Implement a `.remove_data` function for Results
Description
Fitted results from linearmodels
can be pickled with pickle.dump
. These pickled files contain the estimated parameters, along side all the data required to estimate these parameters. Saving the data required to estimate the results is generally (always?) not desired, as keeping this data in the results substantially increases the size of the pickled files. The estimated parameters however no longer require these potentially large datasets to be displayed or processed.
Example
My use case is as follows, with a large (N = 500'000, T = 123) panel dataset.
- Create a list of all desired model specifications and comparisons
- Estimate all the different models
- Save different comparison of these results with
compare
In pseudocode
specifications = pd.DataFrame({"formulas": formulas, "criterium": criteria})
results = []
for formula in specifications["formulas"]:
model = PanelOLS(y, x)
res = model.fit()
results.append(res)
specifications["results"] = results
for criteria in specifications["criteria"].unique():
results = specificiations.query("criterium == @criteria")["results"]
comparison = compare(results)
comparison.summary.as_latex()
As my dataset is very large, pickeling results
or the DataFrame specifications
takes up multiple Gb's to store just a few number of estimated parameters. Ideally, I would be able to store/pickle the results. That way, I can separate the estimating the models from comparing the models. For example, this would allow someone to do the estimations during the night and kill the process once done.
Workaround
I created this hacky workaround to remove a lot of attributes from the model and result that aren't required if you're only interested in storing the results. With this, I can reduce the size of the pickle objects from ~50Gb to around 250Mb.
import functools
def fake_cov(_deferred_cov, *args, **kwargs):
return _deferred_cov
def shrink_mod_and_res(mod, res):
"""
Remove any DataFrame and large objects that are unnecessarily stored in the model and results objects.
"""
mod.dependent._frame = mod.dependent._frame.head(1)
mod.dependent._original = None
mod.dependent._panel = None
mod.exog._frame = mod.exog._frame.head(1)
mod.exog._original = None
mod.exog._panel = None
mod.weights._frame = mod.weights._frame.head(1)
mod.weights._original = None
mod.weights._panel = None
mod._cov_estimators = None
mod._x = None
mod._y = None
mod._w = None
mod._not_null = None
mod._original_index = None
res._resids = None
res._wresids = None
res._original_index = None
res._effects = None
res._index = None
res._fitted = None
res._idiosyncratic = None
res._not_null = None
_deferred_cov = res._deferred_cov()
res._deferred_cov = functools.partial(fake_cov, _deferred_cov=_deferred_cov)
return mod, res
model = PanelOLS(y, x)
res = model.fit()
mod, res = shrink_mod_and_res(mod,res)
It's not clear to my why the calculation of the covariance is deferred? I suppose if you want to change the covariance estimator after the estimation, that this hacky method needs to store all possible covariance estimations.
Suggestion
Implement a (cleaner) method to remove large datasets contained in the Results, similar to the remove_data
flag in the .save()
method of statsmodels
' models.