milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: fail to start milvus with GCP as externalS3

Open punkerpunker opened this issue 1 year ago • 2 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: 2.2.9
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): -
- OS(Ubuntu or CentOS): -
- CPU/Memory: -
- GPU: -
- Others: -

Current Behavior

Hello, Milvus Query Node couldn't authorize with externalS3 of cloudProvider: "gcp", and therefore not able to start. The rest of the components started properly, with no warnings or errors in the logs. I've looked through the Milvus codebase, but at the moment I realised it's a CGo issue, I decided to report it.

Milvus is deployed on the Kubernetes cluster. My kubernetes pods and services subnets: 10.236.64.0/18,10.236.0.0/18

I am deploying Milvus using helm-chart, here are my values.yaml:

milvus:
  cluster:
    enabled: true
    
  metrics:  
    enabled: true
    serviceMonitor:
      enabled: true
      interval: "30s"
      scrapeTimeout: "10s"
      additionalLabels:
        release: kube-prometheus-stack

  queryNode:
    replicas: 3
    extraEnv:
    - name: HTTPS_PROXY
      value: http://proxy:3128
    - name: HTTP_PROXY
      value: http://proxy:3128
    - name: NO_PROXY
      value: milvus-etcd,10.236.64.0/18,10.236.0.0/18,.svc.cluster.local,localhost,127.0.0.1,kubernetes.default.svc

  indexNode:
    extraEnv:
    - name: HTTPS_PROXY
      value: http://proxy:3128
    - name: NO_PROXY
      value: milvus-etcd,10.236.64.0/18,10.236.0.0/18,.svc.cluster.local,localhost,127.0.0.1,kubernetes.default.svc
    replicas: 3

  dataNode:
    extraEnv:
    - name: HTTPS_PROXY
      value: http://proxy:3128
    - name: NO_PROXY
      value: milvus-etcd,10.236.64.0/18,10.236.0.0/18,.svc.cluster.local,localhost,127.0.0.1,kubernetes.default.svc
    replicas: 3
  
  minio:
    enabled: false
  
  indexCoordinator:
    extraEnv:
    - name: HTTPS_PROXY
      value: http://proxy:3128
    - name: NO_PROXY
      value: milvus-etcd,10.236.64.0/18,10.236.0.0/18,.svc.cluster.local,localhost,127.0.0.1,kubernetes.default.svc

  dataCoordinator:
    extraEnv: 
    - name: HTTPS_PROXY
      value: http://proxy:3128
    - name: NO_PROXY
      value: milvus-etcd,10.236.64.0/18,10.236.0.0/18,.svc.cluster.local,localhost,127.0.0.1,kubernetes.default.svc

  externalS3:
    enabled: true
    bucketName: milvus-dev
    host: storage.googleapis.com
    port: 443
    cloudProvider: "gcp"
    useSSL: true
    useIAM: false
    rootPath: "milvus"
    accessKey: <key>
    secretKey: <key>

Expected Behavior

Query Node to be able to authorize with GCS

Steps To Reproduce

No response

Milvus Log

2023/06/07 18:22:59 maxprocs: Leaving GOMAXPROCS=32: CPU quota undefined

    __  _________ _   ____  ______
   /  |/  /  _/ /| | / / / / / __/
  / /|_/ // // /_| |/ / /_/ /\ \
 /_/  /_/___/____/___/\____/___/

Welcome to use Milvus!
Version:   v2.2.9
Built:     Fri Jun  2 09:38:35 UTC 2023
GitCommit: 9ffcd53b
GoVersion: go version go1.18.3 linux/amd64

open pid file: /run/milvus/querynode.pid
lock pid file: /run/milvus/querynode.pid
[2023/06/07 18:22:59.107 +00:00] [INFO] [roles/roles.go:226] ["starting running Milvus components"]
[2023/06/07 18:22:59.107 +00:00] [INFO] [roles/roles.go:152] ["Enable Jemalloc"] ["Jemalloc Path"=/milvus/lib/libjemalloc.so]
[2023/06/07 18:22:59.107 +00:00] [INFO] [management/server.go:68] ["management listen"] [addr=:9091]
[2023/06/07 18:22:59.120 +00:00] [INFO] [config/etcd_source.go:145] ["start refreshing configurations"]
[2023/06/07 18:22:59.121 +00:00] [INFO] [paramtable/quota_param.go:745] ["init disk quota"] [diskQuota(MB)=+inf]
[2023/06/07 18:22:59.121 +00:00] [INFO] [paramtable/quota_param.go:760] ["init disk quota per DB"] [diskQuotaPerCollection(MB)=1.7976931348623157e+308]
[2023/06/07 18:22:59.121 +00:00] [INFO] [paramtable/component_param.go:1543] ["init segment max idle time"] [value=10m0s]
[2023/06/07 18:22:59.121 +00:00] [INFO] [paramtable/component_param.go:1548] ["init segment min size from idle to sealed"] [value=16]
[2023/06/07 18:22:59.121 +00:00] [INFO] [paramtable/component_param.go:1558] ["init segment max binlog file to sealed"] [value=32]
[2023/06/07 18:22:59.121 +00:00] [INFO] [paramtable/component_param.go:1553] ["init segment expansion rate"] [value=1.25]
[2023/06/07 18:22:59.122 +00:00] [INFO] [paramtable/base_table.go:142] ["cannot find etcd.endpoints"]
[2023/06/07 18:22:59.122 +00:00] [INFO] [paramtable/hook_config.go:19] ["hook config"] [hook={}]
[2023/06/07 18:22:59.122 +00:00] [ERROR] [querynode/query_node.go:188] ["load queryhook failed"] [error="fail to set the querynode plugin path"] [stack="github.com/milvus-io/milvus/internal/querynode.NewQueryNode\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/query_node.go:188\ngithub.com/milvus-io/milvus/internal/distributed/querynode.NewServer\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/service.go:83\ngithub.com/milvus-io/milvus/cmd/components.NewQueryNode\n\t/go/src/github.com/milvus-io/milvus/cmd/components/query_node.go:40\ngithub.com/milvus-io/milvus/cmd/roles.runComponent[...].func1\n\t/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:110"]
[2023/06/07 18:22:59.132 +00:00] [INFO] [config/etcd_source.go:145] ["start refreshing configurations"]
[2023/06/07 18:22:59.133 +00:00] [DEBUG] [paramtable/grpc_param.go:153] [initServerMaxSendSize] [role=querynode] [grpc.serverMaxSendSize=536870912]
[2023/06/07 18:22:59.133 +00:00] [DEBUG] [paramtable/grpc_param.go:175] [initServerMaxRecvSize] [role=querynode] [grpc.serverMaxRecvSize=536870912]
[2023/06/07 18:22:59.133 +00:00] [INFO] [querynode/service.go:106] [QueryNode] [port=21123]
[2023/06/07 18:22:59.134 +00:00] [INFO] [querynode/service.go:122] ["QueryNode connect to etcd successfully"]
[2023/06/07 18:22:59.234 +00:00] [INFO] [querynode/service.go:132] [QueryNode] [State=Initializing]
[2023/06/07 18:22:59.234 +00:00] [INFO] [querynode/query_node.go:299] ["QueryNode session info"] [metaPath=by-dev/meta]
[2023/06/07 18:22:59.234 +00:00] [INFO] [sessionutil/session_util.go:202] ["Session try to connect to etcd"]
[2023/06/07 18:22:59.235 +00:00] [INFO] [sessionutil/session_util.go:217] ["Session connect to etcd success"]
[2023/06/07 18:22:59.243 +00:00] [INFO] [sessionutil/session_util.go:300] ["Session get serverID success"] [key=id] [ServerId=594]
[2023/06/07 18:22:59.253 +00:00] [INFO] [config/etcd_source.go:145] ["start refreshing configurations"]
[2023/06/07 18:22:59.253 +00:00] [INFO] [paramtable/quota_param.go:745] ["init disk quota"] [diskQuota(MB)=+inf]
[2023/06/07 18:22:59.253 +00:00] [INFO] [paramtable/quota_param.go:760] ["init disk quota per DB"] [diskQuotaPerCollection(MB)=1.7976931348623157e+308]
[2023/06/07 18:22:59.253 +00:00] [INFO] [paramtable/component_param.go:1543] ["init segment max idle time"] [value=10m0s]
[2023/06/07 18:22:59.253 +00:00] [INFO] [paramtable/component_param.go:1548] ["init segment min size from idle to sealed"] [value=16]
[2023/06/07 18:22:59.253 +00:00] [INFO] [paramtable/component_param.go:1558] ["init segment max binlog file to sealed"] [value=32]
[2023/06/07 18:22:59.253 +00:00] [INFO] [paramtable/component_param.go:1553] ["init segment expansion rate"] [value=1.25]
[2023/06/07 18:22:59.254 +00:00] [INFO] [paramtable/base_table.go:142] ["cannot find etcd.endpoints"]
[2023/06/07 18:22:59.254 +00:00] [INFO] [paramtable/hook_config.go:19] ["hook config"] [hook={}]
[2023/06/07 18:22:59.255 +00:00] [INFO] [logutil/logutil.go:165] ["Log directory"] [configDir=]
[2023/06/07 18:22:59.255 +00:00] [INFO] [logutil/logutil.go:166] ["Set log file to "] [path=]
[2023/06/07 18:22:59.255 +00:00] [INFO] [querynode/query_node.go:209] ["QueryNode init session"] [nodeID=594] ["node address"=10.236.72.81:21123]
[2023/06/07 18:22:59.255 +00:00] [INFO] [querynode/query_node.go:315] ["QueryNode init rateCollector done"] [nodeID=594]
[2023/06/07 18:22:59.695 +00:00] [INFO] [storage/minio_chunk_manager.go:145] ["minio chunk manager init success."] [bucketname=milvus-dev] [root=milvus]
[2023/06/07 18:22:59.695 +00:00] [INFO] [querynode/query_node.go:325] ["queryNode try to connect etcd success"] [MetaRootPath=by-dev/meta]
[2023/06/07 18:22:59.695 +00:00] [INFO] [querynode/segment_loader.go:945] ["SegmentLoader created"] [ioPoolSize=256] [cpuPoolSize=32]
2023-06-07 18:22:59,696 INFO [default] [KNOWHERE][SetBlasThreshold][milvus] Set faiss::distance_compute_blas_threshold to 16384
2023-06-07 18:22:59,696 INFO [default] [KNOWHERE][SetEarlyStopThreshold][milvus] Set faiss::early_stop_threshold to 0
2023-06-07 18:22:59,696 INFO [default] [KNOWHERE][SetStatisticsLevel][milvus] Set knowhere::STATISTICS_LEVEL to 0
2023-06-07 18:22:59,696 | DEBUG | default | [SERVER][operator()][milvus] Config easylogging with yaml file: /milvus/configs/easylogging.yaml
2023-06-07 18:22:59,697 | DEBUG | default | [SEGCORE][SegcoreSetSimdType][milvus] set config simd_type: auto
2023-06-07 18:22:59,697 | INFO | default | [KNOWHERE][SetSimdType][milvus] FAISS expect simdType::AUTO
2023-06-07 18:22:59,697 | INFO | default | [KNOWHERE][SetSimdType][milvus] FAISS hook AVX2
2023-06-07 18:22:59,697 | DEBUG | default | [SEGCORE][SetIndexSliceSize][milvus] set config index slice size(byte): 16777216
2023-06-07 18:22:59,697 | DEBUG | default | [SEGCORE][SetThreadCoreCoefficient][milvus] set thread pool core coefficient: 10
[2023/06/07 18:22:59.719 +00:00] [WARN] [initcore/init_storage_config.go:94] ["InitRemoteChunkManagerSingleton failed, C Runtime Exception: [UnexpectedError] get authorization failed, errcode:UNAVAILABLE\n"]
[2023/06/07 18:22:59.719 +00:00] [ERROR] [querynode/query_node.go:348] ["QueryNode init segcore failed"] [error="[UnexpectedError] get authorization failed, errcode:UNAVAILABLE"] [stack="github.com/milvus-io/milvus/internal/querynode.(*QueryNode).Init.func1\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/query_node.go:348\nsync.(*Once).doSlow\n\t/usr/local/go/src/sync/once.go:68\nsync.(*Once).Do\n\t/usr/local/go/src/sync/once.go:59\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).Init\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/query_node.go:297\ngithub.com/milvus-io/milvus/internal/distributed/querynode.(*Server).init\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/service.go:133\ngithub.com/milvus-io/milvus/internal/distributed/querynode.(*Server).Run\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/service.go:213\ngithub.com/milvus-io/milvus/cmd/components.(*QueryNode).Run\n\t/go/src/github.com/milvus-io/milvus/cmd/components/query_node.go:54\ngithub.com/milvus-io/milvus/cmd/roles.runComponent[...].func1\n\t/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:120"]
[2023/06/07 18:22:59.719 +00:00] [ERROR] [querynode/service.go:134] ["QueryNode init error: "] [error="[UnexpectedError] get authorization failed, errcode:UNAVAILABLE"] [stack="github.com/milvus-io/milvus/internal/distributed/querynode.(*Server).init\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/service.go:134\ngithub.com/milvus-io/milvus/internal/distributed/querynode.(*Server).Run\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/service.go:213\ngithub.com/milvus-io/milvus/cmd/components.(*QueryNode).Run\n\t/go/src/github.com/milvus-io/milvus/cmd/components/query_node.go:54\ngithub.com/milvus-io/milvus/cmd/roles.runComponent[...].func1\n\t/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:120"]
panic: [UnexpectedError] get authorization failed, errcode:UNAVAILABLE

goroutine 194 [running]:
github.com/milvus-io/milvus/cmd/components.(*QueryNode).Run(0x5ba3400?)
	/go/src/github.com/milvus-io/milvus/cmd/components/query_node.go:55 +0x56
github.com/milvus-io/milvus/cmd/roles.runComponent[...].func1()
	/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:120 +0x182
created by github.com/milvus-io/milvus/cmd/roles.runComponent[...]
	/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:104 +0x18a

Anything else?

No response

punkerpunker avatar Jun 07 '23 18:06 punkerpunker

/assign @locustbaby /unassign

yanliang567 avatar Jun 08 '23 00:06 yanliang567

Hi @locustbaby, I don't see this error in the log, though it seems to be the one actually causing the issue:

https://github.com/milvus-io/milvus/blob/9ffcd53bd41af26c34d4308b7d48a64d19acc118/internal/core/src/storage/MinioChunkManager.cpp#L125-L132.

To clarify - does Milvus support using GCS as the externalS3 if not running on the GKE or GCE (therefore no IAM)?

punkerpunker avatar Jun 08 '23 12:06 punkerpunker

@punkerpunker It doesn't support using GCS without IAM.

jaime0815 avatar Jun 12 '23 04:06 jaime0815

feel free to contribute if anyone has the requirement

xiaofan-luan avatar Jun 12 '23 05:06 xiaofan-luan

I'm also facing issues getting GCS to work as externalS3. Some components cannot work with IAM enabled while other components cannot work with IAM disabled.

For example, when I set useIAM: true, dataNode fails with Access denied :

[WARN] [storage/minio_chunk_manager.go:203] ["failed to put object"] [path=insert_log/442152882448630211/442152882448630212/442152882448830296/0/442152882448830306] [error="Access denied."]
...
[WARN] [datanode/flush_task.go:230] ["flush task error detected"] [error="All attempts results:\nattempt #1:All attempts results:\nattempt #1:Access denied.\nattempt #2:Access denied.\nattempt #3:Access denied.\nattempt #4:Access denied.\nattempt #5:Access denied.\n\nattempt #2:All attempts results:\nattempt #1:Access denied.\nattempt #2:Access denied.\nattempt #3:Access denied.\nattempt #4:Access denied.\nattempt #5:Access denied.\n\nattempt #3:All attempts results:\nattempt #1:Access denied.\nattempt #2:Access denied.\nattempt #3:Access denied.\nattempt #4:Access denied.\nattempt #5:Access denied.\n\nattempt #4:All attempts results:\nattempt #1:Access denied.\nattempt #2:Access denied.\nattempt #3:Access denied.\nattempt #4:Access denied.\nattempt #5:Access denied.\n\nattempt #5:All attempts results:\nattempt #1:Access denied.\nattempt #2:Access denied.\nattempt #3:Access denied.\nattempt #4:Access denied.\nattempt #5:Access denied.\n\nattempt #6:All attempts results:\nattempt #1:Access denied.\nattempt #2:Access denied.\nattempt #3:Access denied.\nattempt #4:Access denied.\nattempt #5:Access denied.\n\nattempt #7:All attempts results:\nattempt #1:Access denied.\nattempt #2:Access denied.\nattempt #3:Access denied.\nattempt #4:Access denied.\nattempt #5:Access denied.\n\nattempt #8:All attempts results:\nattempt #1:Access denied.\nattempt #2:Access denied.\nattempt #3:Access denied.\nattempt #4:Access denied.\nattempt #5:Access denied.\n\nattempt #9:All attempts results:\nattempt #1:Access denied.\nattempt #2:Access denied.\nattempt #3:Access denied.\nattempt #4:Access denied.\nattempt #5:Access denied.\n\nattempt #10:All attempts results:\nattempt #1:Access denied.\nattempt #2:Access denied.\nattempt #3:Access denied.\nattempt #4:Access denied.\nattempt #5:Access denied.\n\n"] []
[ERROR] [datanode/flush_manager.go:759] ["flush pack with error, DataNode quit now"] [error="execution failed"] [stack="github.com/milvus-io/milvus/internal/datanode.flushNotifyFunc.func1\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flush_manager.go:759\ngithub.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:204"]
panic: execution failed

Then when I set useIAM: false, dataNode is able to flush the segment. However, the following components fail:

  • queryNode
[ERROR] [querynode/service.go:134] ["QueryNode init error: "] [error="[UnexpectedError] google cloud only support iam mode now"]
  • indexNode
[ERROR] [indexnode/task.go:340] ["failed to build index"] [error="[UnexpectedError] google cloud only support iam mode now"]

In order to overcome these issues, I had to configure querynode, indexnode and indexcoord with useIAM: true while setting useIAM: false globally for other components. I've done this by copying and overriding the milvus config configMap object and attaching it to the affected nodes' deployments. I'm not sure whether we can override configs through extraEnv for each component instead.

Here are the related helm values configs that I've used:

minio:
  enabled: false

externalS3:
  enabled: true # Enable or disable external S3 false
  host: "storage.googleapis.com" # The host of the external S3 unset
  port: 443 # The port of the external S3 unset
  accessKey: "***" # The Access Key of the external S3 unset
  secretKey: "***" # The Secret Key of the external S3 unset
  bucketName: "bucket-name" # The Bucket Name of the external S3  unset
  useSSL: true # If true, use SSL to connect to the external S3  false
  useIAM: false # If true, use iam to connect to the external S3  false
  cloudProvider: "gcp"

dataNode:
  extraEnv:
  - name: "GOOGLE_APPLICATION_CREDENTIALS"
    valueFrom:
      secretKeyRef:
        name: minio-gcs-secret
        key: gcs_key.json

dataCoordinator:
  extraEnv:
  - name: "GOOGLE_APPLICATION_CREDENTIALS"
    valueFrom:
      secretKeyRef:
        name: minio-gcs-secret
        key: gcs_key.json

indexNode:
  extraEnv:
  - name: "GOOGLE_APPLICATION_CREDENTIALS"
    valueFrom:
      secretKeyRef:
        name: minio-gcs-secret
        key: gcs_key.json

indexCoord:
  extraEnv:
  - name: "GOOGLE_APPLICATION_CREDENTIALS"
    valueFrom:
      secretKeyRef:
        name: minio-gcs-secret
        key: gcs_key.json

queryNode:
  extraEnv:
  - name: "GOOGLE_APPLICATION_CREDENTIALS"
    valueFrom:
      secretKeyRef:
        name: minio-gcs-secret
        key: gcs_key.json

Secret to be created using the generated key of the IAM service account:

kubectl create secret generic minio-gcs-secret --from-file=gcs_key.json=minio-gcs-key.json

I'm using these commands to patch deployments:

kubectl patch cm RELEASE_NAME-milvus -p '{"metadata":{ "name":"RELEASE_NAME-milvus-iam"}}' --dry-run=client -o yaml -n NAMESPACE | sed 's/useIAM: false/useIAM: true/g' | kubectl apply -f -

kubectl get deployment RELEASE_NAME-milvus-querynode -o yaml -n NAMESPACE | sed 's/name: RELEASE_NAME-milvus$/name: RELEASE_NAME-milvus-iam/g' > RELEASE_NAME-milvus-querynode-deployment.yaml
kubectl delete deployment RELEASE_NAME-milvus-querynode -n NAMESPACE
kubectl apply -f RELEASE_NAME-milvus-querynode-deployment.yaml -n NAMESPACE

kubectl get deployment RELEASE_NAME-milvus-indexnode -o yaml -n NAMESPACE | sed 's/name: RELEASE_NAME-milvus$/name: RELEASE_NAME-milvus-iam/g' > RELEASE_NAME-milvus-indexnode-deployment.yaml
kubectl delete deployment RELEASE_NAME-milvus-indexnode -n NAMESPACE
kubectl apply -f RELEASE_NAME-milvus-indexnode-deployment.yaml -n NAMESPACE

kubectl get deployment RELEASE_NAME-milvus-indexcoord -o yaml -n NAMESPACE | sed 's/name: RELEASE_NAME-milvus$/name: RELEASE_NAME-milvus-iam/g' > RELEASE_NAME-milvus-indexcoord-deployment.yaml
kubectl delete deployment RELEASE_NAME-milvus-indexcoord -n NAMESPACE
kubectl apply -f RELEASE_NAME-milvus-indexcoord-deployment.yaml -n NAMESPACE

All the above issues are now resolved. However, indexNode is still failing to upload the index:

[INFO] [indexnode/task.go:346] ["Successfully build index"] [buildID=442219743964239016] [Collection=442219743964038755] [SegmentID=442219743964238988]
terminate called after throwing an instance of 'milvus::storage::S3ErrorException'  what():  Error:PutObjectBuffer:AccessDenied  Access denied.
SIGABRT: abort

ahmed-mahran avatar Jun 16 '23 20:06 ahmed-mahran

seems that there is a still a authentication issue

Error:PutObjectBuffer:AccessDenied

@zwd1208 any recommendations?

xiaofan-luan avatar Jun 17 '23 07:06 xiaofan-luan

@ahmed-mahran hi ,sorry a bit confused , I guess the second useIAM should be false?

image

-> And may I know the auth way you using now? IAM or ak/sk?

As our engineer says, milvus only support GCS with IAM now, that's the reason querynode and indexnode throw the error google cloud only support iam mode now

locustbaby avatar Jun 19 '23 09:06 locustbaby

@ahmed-mahran hi ,sorry a bit confused , I guess the second useIAM should be false?

You are right. I've edited my comment.

And may I know the auth way you using now? IAM or ak/sk?

I'm using a hybrid mode:

  • useIAM: true for querynode, indexnode and indexcoord
  • useIAM: false for the rest

ahmed-mahran avatar Jun 19 '23 09:06 ahmed-mahran

Milvus support GCS with IAM, so you can set useIAM: true globally. As you said there was an error Access Deny when you set useIAM: true globally, Can you check your IAM configuration?

locustbaby avatar Jun 19 '23 14:06 locustbaby

I'm setting GOOGLE_APPLICATION_CREDENTIALS environment variable

  extraEnv:
  - name: "GOOGLE_APPLICATION_CREDENTIALS"
    valueFrom:
      secretKeyRef:
        name: minio-gcs-secret
        key: gcs_key.json

The key is generated for a service account with admin privileges image

ahmed-mahran avatar Jun 19 '23 15:06 ahmed-mahran

@ahmed-mahran Have you ever restart the cluster? Still stuck? Can you try standalone with GCS IAM?

locustbaby avatar Jun 26 '23 09:06 locustbaby

I've tried standalone with GCS IAM and I'm getting the same errors

[WARN] [storage/minio_chunk_manager.go:203] ["failed to put object"] [path=insert_log/442490109151675611/442490109151675612/442490109151875757/0/442490186934517781] [error="Access denied."]
[WARN] [datanode/flush_task.go:230] ["flush task error detected"] [error="All attempts results:\nattempt #1:All attempts results:\nattempt #1:Access denied.\nattempt #2:Access denied.\nattempt #3:Access denied.\nattempt #4:Access denied.\nattempt #5:Access denied.\n\nattempt #2:All attempts results:\nattempt #1:Access denied.\nattempt #2:Access denied.\nattempt #3:Access denied.\nattempt #4:Access denied.\nattempt #5:Access denied.\n\nattempt #3:All attempts results:\nattempt #1:Access denied.\nattempt #2:Access denied.\nattempt #3:Access denied.\nattempt #4:Access denied.\nattempt #5:Access denied.\n\nattempt #4:All attempts results:\nattempt #1:Access denied.\nattempt #2:Access denied.\nattempt #3:Access denied.\nattempt #4:Access denied.\nattempt #5:Access denied.\n\nattempt #5:All attempts results:\nattempt #1:Access denied.\nattempt #2:Access denied.\nattempt #3:Access denied.\nattempt #4:Access denied.\nattempt #5:Access denied.\n\nattempt #6:All attempts results:\nattempt #1:Access denied.\nattempt #2:Access denied.\nattempt #3:Access denied.\nattempt #4:Access denied.\nattempt #5:Access denied.\n\nattempt #7:All attempts results:\nattempt #1:Access denied.\nattempt #2:Access denied.\nattempt #3:Access denied.\nattempt #4:Access denied.\nattempt #5:Access denied.\n\nattempt #8:All attempts results:\nattempt #1:Access denied.\nattempt #2:Access denied.\nattempt #3:Access denied.\nattempt #4:Access denied.\nattempt #5:Access denied.\n\nattempt #9:All attempts results:\nattempt #1:Access denied.\nattempt #2:Access denied.\nattempt #3:Access denied.\nattempt #4:Access denied.\nattempt #5:Access denied.\n\nattempt #10:All attempts results:\nattempt #1:Access denied.\nattempt #2:Access denied.\nattempt #3:Access denied.\nattempt #4:Access denied.\nattempt #5:Access denied.\n\n"] []
[ERROR] [datanode/flush_manager.go:759] ["flush pack with error, DataNode quit now"] [error="execution failed"] [stack="github.com/milvus-io/milvus/internal/datanode.flushNotifyFunc.func1\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flush_manager.go:759\ngithub.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:204"]
panic: execution failed

Please note that with the hybrid authentication mode https://github.com/milvus-io/milvus/issues/24727#issuecomment-1596850829 that I've tested on cluster mode, dataNode was able to write the segment to GCS however indexNode is failing to put index.

[INFO] [indexnode/task.go:346] ["Successfully build index"] [buildID=442219743964239016] [Collection=442219743964038755] [SegmentID=442219743964238988]
terminate called after throwing an instance of 'milvus::storage::S3ErrorException'  what():  Error:PutObjectBuffer:AccessDenied  Access denied.
SIGABRT: abort

ahmed-mahran avatar Jun 28 '23 15:06 ahmed-mahran

/assign @haorenfsa could you please help dudes on the gcp access problem

xiaofan-luan avatar Jun 29 '23 06:06 xiaofan-luan

Hi @ahmed-mahran, The module in milvus to controller the object storage is called chunk manager, there're 2 type of chunk managers in milvus: The Golang chunk manager in golang code, and the Cpp chunk manager in cpp code.

The Golang chunk manager supports GCS well, whether UseIAM or not, however the Cpp chunk manager for now only supports UseIAM.

In previous versions, only diskANN index uses Cpp chunk manager, so it works well. now we're switching to use Cpp chunk manager only. So that's why things go wrong.

And I also see that you're trying to use the GOOGLE_APPLICATION_CREDENTIALS. However our cpp code uses AWS SDK, so it's not supported.

Here're a couple of solutions you can choose for now:

  1. Use a MinIO GCS gateway to proxy all requests, and that supports GOOGLE_APPLICATION_CREDENTIALS. The detailed steps were stated in our former docs: https://milvus.io/docs/v2.1.x/gcp.md
  2. Use IAM access for all Milvus components, you'll need to create a GCP service account and assign authority and add annotations to the kubernetes service account. (check GCP's doc for configuration https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity). And finally, make sure Milvus pods uses that service account.
  3. Use an old version of Milvus which only uses golang chunk manager, v2.2.6 or earlier. ( If you're not going to use diskANN)

We'll fix it soon, and we'll see to the full function available in next release

haorenfsa avatar Jun 29 '23 07:06 haorenfsa

Thanks for the detailed answer, @haorenfsa

  1. Use a MinIO GCS gateway to proxy all requests, and that supports GOOGLE_APPLICATION_CREDENTIALS. The detailed steps were stated in our former docs: https://milvus.io/docs/v2.1.x/gcp.md

This was the first thing I've tried. However, MinIO crashes as GCS gateway feature is deprecated and removed https://blog.min.io/deprecation-of-the-minio-gateway/. I guess I would need to find an older compatible version of MinIO.

  1. Use IAM access for all Milvus components, you'll need to create a GCP service account and assign authority and add annotations to the kubernetes service account. (check GCP's doc for configuration https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity). And finally, make sure Milvus pods uses that service account.

I've also tried the Workload Identity but unfortunately it didn't work. I've verified that my setup is ok following https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#verify_the_setup. I'm not able to provide much details on this as I don't have the logs but from my search history I can tell that I was getting Permission 'iam.serviceAccounts.getAccessToken' denied on resource (or it may not exist). Both service accounts for k8s and google services were given full admin privileges.

  1. Use an old version of Milvus which only uses golang chunk manager, v2.2.6 or earlier. ( If you're not going to use diskANN)

Not sure whether multi tenancy and RBAC through databases was supported then.

We'll fix it soon, and we'll see to the full function available in next release

That's good news! I think I'll wait until next release, use a nightly version or apply the patch and build my own version.

ahmed-mahran avatar Jun 29 '23 20:06 ahmed-mahran

@ahmed-mahran Thank you for the patience 😂. About solution 2, it should be working, our service on gcp also adopts this method. You can check your configuration correctness by kubectl exec <pod> -- bash into the milvus pod and execute following commands:

# acquire identity token from gcp meta server
curl "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token" -H "Metadata-Flavor: Google"

Find the token field in output, then copy it and execute:

export token=<token>
export bucket=<my-bucket>
# check if you can now list objects in a bucket
curl "https://storage.googleapis.com/$bucket?list-type=2&prefix=" -H "Authorization: Bearer $token"

If all your configuration is correct, commands above should be working. If not, you can diagnose the problem with the output hints.

From my experience, it's likely one of the below steps goes wrong. I can help you check if you'd like to provide your NAMESPACE & KSA_NAME image

And I've tested the fix patch in gcp, it would be merge soon.

haorenfsa avatar Jul 03 '23 03:07 haorenfsa