Support optional dependencies for CUDA related packages (torch, etc.)
Problem Description
"pip install sdv" has a lot of dependencies related to CUDA and Torch that should be optional (like "pip install sdv[no-nvidia]" for example). This would make it a lot easier for organizations to get the Community version running in their own VPCs/images.
Expected behavior
"pip install sdv[no-nvidia]" should download a version of SDV which is CPU only and which does not use CUDA or require all of the NVIDIA libraries.
Additional context
This will make it easier for developers inside companies to demonstrate the value of SDV to their managers,
Hi there @matanitah unfortunately pip doesn't have a way to explicitly exclude specific dependencies (only a way to explicitly include dependencies). So this means we'd have to slim down the dependencies in base SDV (the most common starting point & install path) quite a bit and add CUDA-based packages as a set of optional dependencies. This approach has it's own tradeoffs as well!
To help us better understand, do you mind sharing more context on the barriers you're encountering when trying to "get the Community version running in [your] own VPCs/images" ?
Sure! The NVIDIA libraries increases the size of the image we have to run quite dramatically, and its really only relevant if we decide to run our SDV code on GPUs, which in our case we have opted to go with CPU anyway. I think having an option like: "pip install sdv[gpu]" would be beneficial because it would keep the size of the image needed to run SDV community code light for those who only want to use CPU, which makes EC2 load times faster and makes it easier for us to keep the cost of infrastructure low.
Hi @matanitah It’s great to see your interest in the SDV ecosystem. This comment is a reminder to consult your legal team before adopting the SDV into your project, as SDV has a source-available license.
For more information, you can read through our license FAQs (not legal advice). For any other questions, you can Contact Us. You can also inquire about a commercial license to allow additional use.
That makes sense @matanitah thanks for sharing more context! I'll leave this issue open as a feature request for the team :)
This is similar to this other feature request as well: https://github.com/sdv-dev/SDV/issues/1621
Seconding this issue - makes it very cumbersome to deploy SDV, even if only a reduced feature set is used.
Thanks @spreeni. We are still evaluating within the team. Python unfortunately does not allow you to remove dependencies, or else we would love do something like pip install sdv[no-cuda]. We are still evaluating the pros/cons of changing the default dependencies listed for pip install sdv, which have been established for many years now.
In the meantime, would you be able to describe more about how it's affecting your deployment -- for eg. is it increasing installation time, using up more memory, or is it something else? Any details you can share about where you're deploying, how often you are installing, etc. will be very helpful for us to make the case. Thank you.
Hey @npatki, I have used SDV within a Docker container that I push to a Gitlab container registry, from where it is then pulled by other deployment services. For me the issues are the following
- package size - it takes very long to deploy this to the registry (although less so on subsequent pushes due to delta updates), and I am not sure how lazily the imports are handled, but it could also increase memory demand running in any deployment
- installation time - it takes quite a while to install updates if changes in the library occur. This also makes it less probable for quick demos and tests, where you may not need the accuracy of sophisticated deep learning models
- dependency bloat - it it just not nice to carry a lot of dependencies with you that you don't use (e.g. GPU-enabled torch, plotly, boto3). Here, an opt-in process would be nice.
In the following minimal example, the library adds 6GB of dependencies to my Docker image.
FROM --platform=linux/x86_64 python:3.12-slim-bookworm
RUN python -m venv /venv
ENV PATH="/venv/bin:$PATH"
RUN pip install --no-cache-dir --default-timeout=300 sdv
Hello @matanitah and @spreeni, are you still working with SDV?
The good news is that starting from today's release (SDV v1.23.0), you should be able to use import SDV even if you don't have torch installed. You should also be able to use any SDV synthesizer that does not require torch (eg. GaussianCopulaSynthesizer).
# GaussianCopulaSynthesizer works
>>> from sdv.single_table import GaussianCopulaSynthesizer
>>> synthesizer = GaussianCopulaSynthesizer(metadata)
>>> synthesizer.fit(data)
>>> synthetic_data = synthesizer.sample(num_rows=100)
# Other synthesizers that require torch will not work
>>> from sdv.single_table import CTGANSynthesizer
>>> synthesizer = CTGANSynthesizer(metadata)
ModuleNotFoundError: No module named 'torch'. Please install torch in order to use the 'CTGANSynthesizer'.
Note that this SDV still lists torch as a dependency because we do believe that CTGAN, TVAE, etc. are important components of the SDV package. But when setting up your environment, you can bypass this if you need by:
- Install SDV, and then uninstalling
torchOR - Installing packages only from a pre-defined
requirements.txtI tested this out on this requirements.txt.
pip install sdv --no-deps --requirement requirements.txt
In either option, you will end up with SDV installed without torch, which should hopefully unblock you from your project.
Let us know if this doesn't work or if you have any other feedback around this. Thanks.
I'm closing off this issue since we are now supporting the ability to use SDV features without having CUDA or torch installed -- #2551.
We have made the decision to still list torch as a dependency of SDV, since CTGAN, TVAE, etc. are important, much-used features in the SDV package. However, you should still be able to set up your environment following the code above.
Please feel free to reply if there is more to discuss -- as I can always re-open the issue for further investigation. Or alternatively, file a new issue for new requests/questions. Thanks all!