[bitnami/mlflow] fix issue 26397 - allow MLFLOW_S3_ENDPOINT without minio
Description of the change
Changes the condition to check whether either Minio OR externalS3 is enabled.
- [x] Double check if we shouldn't use the condition in _helpers.tpl
[edit] moved setting the variable to inside the already existing check if s3 is enabled. This simplifies it.
Benefits
This fixes a problem / bug in the chart where an external S3 address cannot be set unless Minio is also enabled; which is unwanted / undesirable.
Possible drawbacks
none really
Applicable issues
- fixes #26397
Additional information
Checklist
- [x] Chart version bumped in
Chart.yamlaccording to semver. This is not necessary when the changes only affect README.md files. - [x] Variables are documented in the values.yaml and added to the
README.mdusing readme-generator-for-helm - [x] Title of the pull request follows this pattern [bitnami/<name_of_the_chart>] Descriptive title
- [x] All commits signed off and in agreement of Developer Certificate of Origin (DCO)
Hi @dhrp, thank you very much for contributing to this Helm chart.
This PR is reverting the changes introduced in #25294, sent on an attempt to fix #23959. Wouldn't merging your proposed changes cause regressions for the users affected by #23959?
In that issue, we came to the conclusion that MLFLOW_S3_ENDPOINT_URL was only required when MinIO was enabled (maybe that conclusion is wrong?) and users requiring to set the envar in some other scenarios would always be able to set it via extraEnvVars like this:
$ helm template oci://registry-1.docker.io/bitnamicharts/mlflow --version 1.2.2 --set minio.enabled=false --set tracking.extraEnvVars[0].name=MLFLOW_S3_ENDPOINT_URL,tracking
.extraEnvVars[0].value='https://my-s3-endpoint:443' | grep -A1 MLFLOW_S3_ENDPOINT_URL
- name: MLFLOW_S3_ENDPOINT_URL
value: https://my-s3-endpoint:443
Can you share your thoughts on this? Thank you!
@frittentheke was right; I had actually first fixed it right; and then simplified it wrong. When S3 is enabled and Minio is not, the host does not need to be set when you are using Amazon AWS S3.
I now resolved it by dropping my own last commit.
I left a suggestion and a concern as an inline comment. Please, check it when you have a chance! Thank you.
I left a suggestion and a concern as an inline comment. Please, check it when you have a chance! Thank you.
I replied in the thread @andresbono .
Would be nice to get the ability to use external non-AWS S3.
I've resolved the comment regarding the helper and rebased on main.
@javsalgar thanks for reopening https://github.com/bitnami/charts/issues/26397. Could you please kindly take a peek at this PR then?
Thank you, @dhrp, @frittentheke. I reopened https://github.com/bitnami/charts/pull/26462#discussion_r1650708286. Sharing my concerns there.
Thank you for the additional information. However, I still have the concern that this PR will introduce a regression and we will face #23959 again. I'm not saying we do not have to address #26397, but I'm not sure if this PR is the proper fix.
I gave this PR a try. Let me share my findings (click on the collapsible scenarios to see the details):
Scenario 1: Store artifacts in external S3-compatible storage (MinIO) ✅
To take the minio subchart out of the equation, I'm deactivating it and explicitly installing it with hardcoded credentials.
- Install MinIO
values-minio.yaml customizations:
auth:
rootPassword: somepassword
defaultBuckets: test-mlflow
provisioning:
enabled: true
extraCommands:
- mc anonymous set download provisioning/test-mlflow
$ helm install minio oci://registry-1.docker.io/bitnamicharts/minio --version 14.6.16 --values /tmp/values-minio.yaml
- Install mlflow (with the changes from this PR)
values-mlflow-minio.yamlcustomizations:
run:
source:
configMap:
# Example taken from the MLFlow UI
test.py: |
import mlflow
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
import sklearn.ensemble
# set the experiment id
mlflow.set_experiment(experiment_id="0")
mlflow.autolog()
db = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(db.data, db.target)
# Create and train models.
rf = sklearn.ensemble.RandomForestRegressor(n_estimators=100, max_depth=6, max_features=3)
rf.fit(X_train, y_train)
# Use the model to make predictions on the test dataset.
predictions = rf.predict(X_test)
minio:
enabled: false
externalS3:
host: minio.default.svc.cluster.local
port: 9000
protocol: http
bucket: test-mlflow
accessKeyID: admin
accessKeySecret: somepassword
$ helm install mlflow-minio ./bitnami/mlflow --values /tmp/values-mlflow-minio.yaml
- Create experiment run
$ kubectl exec -it deployment/mlflow-minio-run -- python test.py
2024/07/01 14:19:27 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '7fe8ec68831b4eeea1680fe52753d45a', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
- Check artifacts in MinIO ✅
$ kubectl exec -it deployment/minio -- mc ls local/test-mlflow/0
[2024-07-01 14:25:36 UTC] 0B 7fe8ec68831b4eeea1680fe52753d45a/
- Check artifacts via UI ✅
$ kubectl get secret --namespace default mlflow-minio-tracking -o jsonpath="{.data.admin-password }" | base64 -d
$ kubectl port-forward svc/mlflow-minio-tracking 8080:80
http://localhost:8080/#/experiments/0/runs/7fe8ec68831b4eeea1680fe52753d45a/artifacts
- Check mlflow logs ✅
$ kubectl logs deploy/mlflow-minio-tracking
Scenario 2: Store artifacts in AWS S3 ❌
- Create an AWS S3 bucket
test-mlflow-ab891a3
- Install mlflow (with the changes from this PR)
values-mlflow-s3.yaml customizations:
run:
source:
configMap:
# Example taken from the MLFlow UI
test.py: |
import mlflow
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
import sklearn.ensemble
# set the experiment id
mlflow.set_experiment(experiment_id="0")
mlflow.autolog()
db = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(db.data, db.target)
# Create and train models.
rf = sklearn.ensemble.RandomForestRegressor(n_estimators=100, max_depth=6, max_features=3)
rf.fit(X_train, y_train)
# Use the model to make predictions on the test dataset.
predictions = rf.predict(X_test)
minio:
enabled: false
externalS3:
host: test-mlflow-ab891a3.s3.amazonaws.com
bucket: test-mlflow-ab891a3
useCredentialsInSecret: false
tracking:
extraEnvVars:
- name: AWS_ACCESS_KEY_ID
value: ASIA***
- name: AWS_SECRET_ACCESS_KEY
value: '***'
- name: AWS_SESSION_TOKEN
value: '***'
$ helm install mlflow-s3 ./bitnami/mlflow --values /tmp/values-mlflow-s3.yaml
- Create experiment run
$ kubectl exec -it deployment/mlflow-s3-run -- python test.py
2024/07/01 15:12:56 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'f9e2c06af11b4ff787ee400e5912d75b', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
- Check artifacts in S3 ⚠️
$ aws s3 ls s3://test-mlflow-ab891a3/test-mlflow-ab891a3/0/
PRE f9e2c06af11b4ff787ee400e5912d75b/
They exist but the objects are uploaded in a root directory named like the bucket.
- Check artifacts via UI ❌
$ kubectl get secret --namespace default mlflow-s3-tracking -o jsonpath="{.data.admin-password }" | base64 -d
$ kubectl port-forward svc/mlflow-s3-tracking 8080:80
http://localhost:8080/#/experiments/0/runs/f9e2c06af11b4ff787ee400e5912d75b/artifacts
Artifacts cannot be retrieved.
- Check mlflow logs ❌
$ kubectl logs deploy/mlflow-s3-tracking
2024/07/01 15:17:11 ERROR mlflow.server: Exception on /ajax-api/2.0/mlflow/artifacts/list [GET]
Traceback (most recent call last):
File "/opt/bitnami/python/lib/python3.10/site-packages/flask/app.py", line 1473, in wsgi_app
response = self.full_dispatch_request()
...
File "/opt/bitnami/python/lib/python3.10/site-packages/botocore/client.py", line 1021, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the ListObjectsV2 operation: The specified key does not exist.
I guess the error in "Scenario 2" is raised because the objects are not written in the correct path (the root directory named like the bucket I believe). The documentation warns about this: https://mlflow.org/docs/2.14.1/tracking/artifacts-stores.html#setting-bucket-region.
With bitnami/mlflow:1.4.12, "Scenario 2" is working as expected and "Scenario 1" would just need to explicitly set the MLFLOW_S3_ENDPOINT_URL env-var:
tracking:
extraEnvVars:
- name: MLFLOW_S3_ENDPOINT_URL
value: 'http://minio.default.svc.cluster.local:9000'
Let me know your thoughts on this @dhrp, @frittentheke.
I gave this PR a try. Let me share my findings (click on the collapsible scenarios to see the details): Scenario 1: Store artifacts in external S3-compatible storage (MinIO) ✅ Scenario 2: Store artifacts in AWS S3 ❌
I guess the error in "Scenario 2" is raised because the objects are not written in the correct path (the root directory named like the bucket I believe). The documentation warns about this: https://mlflow.org/docs/2.14.1/tracking/artifacts-stores.html#setting-bucket-region.
With
bitnami/mlflow:1.4.12, "Scenario 2" is working as expected and "Scenario 1" would just need to explicitly set theMLFLOW_S3_ENDPOINT_URLenv-var
First of all thank you very much for the time and intensive testing @andresbono! Sorry it took me a while to respond.
The issue here is simply the two addressing styles for buckets. There is the VirtualHosted- and the Path-Style.
This is the MLflow code (boto3 AWS S3 client) that is used here: https://github.com/mlflow/mlflow/blob/0ebbefc8c075165b558dfb481cf1e39d5cb53003/mlflow/store/artifact/s3_artifact_repo.py#L72C50-L72C89
The addressing_style parameter is the important bit - see https://boto3.amazonaws.com/v1/documentation/api/1.9.42/guide/s3.html#changing-the-addressing-style
MLflow actually allows to also set the MLFLOW_BOTO_CLIENT_ADDRESSING_STYLE via https://github.com/mlflow/mlflow/blob/0ebbefc8c075165b558dfb481cf1e39d5cb53003/mlflow/environment_variables.py#L605 and defaults to auto.
And auto in case of AWS means that a hostname containing a bucket name in front already addresses the individual bucket. So the first part of the path is NOT the bucket (as with the path-style), but a first "subfolder" - which is exactly what you saw.
If you change the endpoint to a path-style pattern with the bucket name at the end (and NOT as hostname) it should work.
Edit: Maybe it makes sense to also expose the MLFLOW_BOTO_CLIENT_ADDRESSING_STYLE (https://github.com/mlflow/mlflow/blob/0ebbefc8c075165b558dfb481cf1e39d5cb53003/mlflow/environment_variables.py#L605) variable via the Helm chart / values.yaml, to make the externalS3 configuration as flexible as mlFlow itself? @dhrp ?
@andresbono @javsalgar Could you kindly check if my explanation (https://github.com/bitnami/charts/pull/26462#issuecomment-2209054770) holds true and this PR could move forward?
That made the trick, @frittentheke. Thank you so much for looking into it.
For external AWS S3, externalS3.host=s3.amazonaws.com works flawlessly, there is no need to include the bucket name in the host (virtual-hosted–style) as it leads to a malformed endpoint uri.
Edit: Maybe it makes sense to also expose the
MLFLOW_BOTO_CLIENT_ADDRESSING_STYLE(https://github.com/mlflow/mlflow/blob/0ebbefc8c075165b558dfb481cf1e39d5cb53003/mlflow/environment_variables.py#L605) variable via the Helm chart / values.yaml, to make the externalS3 configuration as flexible as mlFlow itself? @dhrp ?
A user could pass the MLFLOW_BOTO_CLIENT_ADDRESSING_STYLE via the tracking.extraEnvVars value if needed.
@andresbono should this auto-merge or are the CI failures expected?
@andresbono should this auto-merge or CI are the failures expected?
No, the CI failures are not expected. Let me look into it.
Finally merged!! Thank you so much for being collaborative and digging into our suggestions and questions!!