aws-step-functions-data-science-sdk-python
aws-step-functions-data-science-sdk-python copied to clipboard
Make sagemaker an optional dependency
I'd like to see sagemaker marked as an optional dependency, as it is not always required.
Use Case
Not all users of this package are using the sagemaker dependency, and it adds a lot of bloat to overall installations by forcing users to have gcc and g++ to be able to install numpy and pandas.
Proposed Solution
It's quite simple to actually do this.
In setup.py
you'd have to change install_requires
and add extras_require
:
setup(
# [...]
install_requires = [
"boto3>=1.14.38",
"pyyaml"
]
extras_require = {
"sagemaker": ["sagemaker>=2.1.0"],
"test": [] # [...]
}
# [...]
)
In addition to this, some extra documentation should probably be added to the README to make it easier for users to understand how to install the extra dependency (pip install stepfunctions[sagemaker]
).
Other
If this is implemented, the major version probably has to be bumped because it's a change in how the package is installed.
- [x] :wave: I may be able to implement this feature request
- [x] :warning: This feature might incur a breaking change
This is a :rocket: Feature Request
Makes sense. We received similar request in https://github.com/aws/aws-step-functions-data-science-sdk-python/issues/116
I agree this would be a breaking change. I'm not sure how many customers are using this SDK without sagemaker. So far, I've only seen this request from 3 individuals.
I did some testing, and it looks like some of the implicit imports in stepfunctions.steps
would also need to be conditional if this change is to be implemented. Specifically the ones around sagemaker.
If you want me to, I'll can look more into it and send a PR next week!
As a side-note, here is a bit more info about my use-case:
We're currently using step functions as a deployment pipeline inside our accounts. I'm currently doing a rewrite to make the pipeline more powerful by using this library to dynamically create step functions based on a configuration file for each repo.
The main problem right now is that I have to build the lambda inside of a docker container running terraform. This container doesn't have gcc, g++ or make by default, so I have to install it at runtime to avoid redoing a bunch of container images. Numpy and pandas also take a long time to install 😞
Luckely, I saw the announcement about lambdas being able to use ECR in other account a few days ago, so that should mostly alleviate the problem for me when I get it implemented (we're using an account per team, per environment).
Here is the current work in progress, if you're interested: nsbno/terraform-aws-delivery-pipeline.
My team would love if this work was completed and merged. We prefer writing amazon states language in Python as opposed to JSON, and this gives us the functionality to do that. An implementation detail prevents us from using the CDK where the step functions are defined, so making this SDK lighter would make it easier to use.
An implementation detail prevents us from using the CDK where the step functions are defined
Can you elaborate on what issue prevents you from using CDK? @DrewThomasCorps
For the public record, this conversation was held offline. CDK also adds some bloat and in our case it's preferable to write Amazon States Language with a simple library, rather than requiring constructs, IDs, and stacks.
This is not a pressing concern and there are workarounds in place so we do not need this immediately resolved, but will be keeping an eye on it.