causalml
causalml copied to clipboard
Reduce number of 3rd party packages required for a prediction-only setup
My use case is that I'm running a trained causalml
model in a server. I'm done with analysis, hyperopt, visualization, ... all that isn't necessary any more. So I pickled my model and moved it to a designated production environment which I configured in a way that it can unpickle the model and run predictions on it.
But the way causalml
is set up, many of those "non-core" packages that deal with training and analysis are still hard runtime-dependencies, even if I were to install causalm
with --no-deps
(as suggested here https://github.com/uber/causalml/pull/250#issuecomment-729884475, which I'd really like to avoid). Just to show an example, the model I'm using is causalml.inference.tree.causal.causalforest.CausalRandomForestRegressor
, and in causalml.inference.tree.__init__.py
all of the local modules are imported as well (e.g. causalml.inference.tree.plot
, leading to a number of the 3rd part imports that I have an issue with, like seaborn
, matplotlib
, pydotplus
, ...).
Would it be possible to separate every dependency that isn't necessary to run predictions into extras? Or at least, restructure the code in a way where a manual install of the actual runtime-dependencies won't lead to unrelated 3rd party package imports? I realize this is a massive ask, but it's a serious problem for me that I can't solve without forking your project and run my own builds (which I'd really, really like to avoid).
Just to give an idea of why it's an issue:
base/Dockerfile
# need a builder since no wheels are released to pypi, except for a single pyhton3.8 mac build?
FROM python:3.10-slim as builder
RUN apt-get update && \
apt-get -y install build-essential
RUN pip install setuptools>=18.0 wheel cython numpy "scikit-learn<=1.0.2"
RUN pip install causalml --no-deps
RUN pip wheel -w wheels causalml --no-deps
FROM python:3.10-slim
COPY --from=builder wheels wheels
RUN pip install "scikit-learn<=1.0.2" packaging forestci tqdm pathos && \
pip install wheels/causalml* --no-deps
This image contains the core set of 3rd party packages necessary to predict with a CausalRandomForestRegressor
. I didn't investigate what other models would need, but numerical computation libraries don't have a massive disk footprint any way -- the whole image is 507MB big, which is reasonable for a simple ML backend.
actual/Dockerfile
FROM python:3.10-slim as builder
RUN apt-get update && \
apt-get -y install build-essential
RUN pip install setuptools>=18.0 wheel cython numpy "scikit-learn<=1.0.2"
RUN pip install causalml --no-deps
RUN pip wheel -w wheels causalml --no-deps
FROM python:3.10-slim
COPY --from=builder wheels wheels
RUN pip install wheels/causalml*
This is the whole package, and visualization libs do tend to eat up a fair share of disk space. Plus torch. The image clocks in at 6.54GB, so a difference of ~6GB which I do not need.
My CI/CD straight up refuses to run this build for me because it doesn't support artifacts of this size. I didn't even know that could happen.
I couldn't find similar issues in the tracker, apologies if I just missed them. In case I didn't I'd be surprised though, am I actually the first user who has this issue? Is dockerizing / running causalml in a server a strange thing to do?
Regarding PRs, I might be able to write one, but wouldn't start unless the issue itself is green-flagged by the maintainers.
Thanks for submitting this, @a-recknagel. Addressing this will help many others who'd like to deploy the causalml models. Can you take a stab at it?
A couple of things I can think of are:
- Removing
plot.*
from all__init__.py
- Making
pytorch
an optional dependency similar to #343
Ok, that's good to know, I'd love to try. I hope to keep the changes to these two domains, changing import paths and writing extra groups, but either of these I'd consider a breaking change. Not that that'll stop me, and the project is still in zero_ver so it won't matter much, but I guess I want to ask how careful I should be. Should I read up on custom importer overloads to try and keep existing import paths working, or would that be a wasted effort?
Also, I'll probably touch most files in the project due to moving folders. Are there any particular WIPs or branches that I should consider or wait for before starting? The merge conflicts would be spectacularly bad.