cortex icon indicating copy to clipboard operation
cortex copied to clipboard

Ruler unable to list rules when s3 bucket uses percentage encoding

Open PatrikMuniak opened this issue 2 years ago • 7 comments

Describe the bug We are trying to set up Cortex on premises and we are using a compatible s3 bucket called Hitachi Content Platform. Cortex Ruler failing to read rules on Hitachi Content Platform s3 compatible bucket. When Cortex tries to list the rulegroups it retrieves the bucket objects ( e.g. bG9raS1ub2Rlcy1ydWxlcw== on the bucket) with percent encoded characters %3D ( e.g. bG9raS1ub2Rlcy1ydWxlcw%3D%3D), this makes the decoding fail when listing rulegroups.

https://github.com/cortexproject/cortex/blob/347aacd2c836d5842db8ec972b40a26345b41d82/pkg/ruler/rulestore/bucketclient/bucket_client.go#L300

to reproduce the issue in the code I wrote this test.

package main

import (
	"encoding/base64"
	"fmt"
)

func main() {
	decodedNamespace, err := base64.URLEncoding.DecodeString("bG9raS1ub2Rlcy1ydWxlcw%3D%3D")//%3D%3D
	encoded := base64.URLEncoding.EncodeToString([]byte("loki-nodes-rules"))
	decoded, err2 := base64.URLEncoding.DecodeString(encoded)
	fmt.Println(string(decodedNamespace), err)
	fmt.Println(string(decoded), err2)
}

loki-nodes-rule illegal base64 data at input byte 22
loki-nodes-rules <nil>

To Reproduce Steps to reproduce the behavior:

  1. Start Cortex 1.11.1 with single-process-config-blocks.yaml
  2. set up a HCP bucket in the ruler
  3. upload a sample rule ./cortextool rules load ~/notes/paas/cortex-rules-alerts/ruler/loki-nodes-rules.yaml --address=http://<url>:9008 --id=nap-tom
  4. Check logs for errors coming from bucket_client.go ( check below fro the log I received)

config.yaml

# Configuration for running Cortex in single-process mode.
# This should not be used in production.  It is only for getting started
# and development.

# Disable the requirement that every request to Cortex has a
# X-Scope-OrgID header. `fake` will be substituted in instead.
auth_enabled: false

server:
  http_listen_port: 9008
  grpc_listen_port: 9099
  log_level: debug
  # Configure the server to allow messages up to 100MB.
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  grpc_server_max_concurrent_streams: 1000

distributor:
  shard_by_all_labels: true
  pool:
    health_check_ingesters: true

ingester_client:
  grpc_client_config:
    # Configure the client to allow messages up to 100MB.
    max_recv_msg_size: 104857600
    max_send_msg_size: 104857600
    grpc_compression: gzip

ingester:
  lifecycler:
    # The address to advertise for this ingester.  Will be autodiscovered by
    # looking up address on eth0 or en0; can be specified if this fails.
    # address: 127.0.0.1
    interface_names: [ens160] 
    # We want to start immediately and flush on shutdown.
    join_after: 0
    min_ready_duration: 0s
    final_sleep: 0s
    num_tokens: 512

    # Use an in memory ring store, so we don't need to launch a Consul.
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1

storage:
  engine: blocks

blocks_storage:
  tsdb:
    dir: /tmp/cortex/tsdb

  bucket_store:
    sync_dir: /tmp/cortex/tsdb-sync

  # You can choose between local storage and Amazon S3, Google GCS and Azure storage. Each option requires additional configuration
  # as shown below. All options can be configured via flags as well which might be handy for secret inputs.
  backend: s3 # s3, gcs, azure or filesystem are valid options
  s3:
    bucket_name: eu-cortex-metrics
    endpoint: url
    access_key_id: "user"
    secret_access_key: "password"
    #insecure: true
    #signature_version: "v2"
    http:
      insecure_skip_verify: true


compactor:
  data_dir: /tmp/cortex/compactor
  sharding_ring:
    kvstore:
      store: inmemory

frontend_worker:
  match_max_concurrent: true

ruler:
  enable_api: true
  enable_sharding: false
  rule_path: /tmp/cortex/tmp-rules

ruler_storage:
  backend: s3
  local:
    directory: /tmp/cortex/rules
  s3:
    bucket_name: eu-cortex-ruler
    endpoint: url
    access_key_id: "user"
    secret_access_key: "password"
    #insecure: true
    #signature_version: "v2"
    http:
      insecure_skip_verify: true
EOF

loki-nodes-rules.yaml

groups:
  - name: loki-nodes
    rules:
    - alert: loki-up
      expr: up{application="loki"} == 1
      labels:
            severity: MAJOR
      annotations:
            description: "Loki is not running on {{ $labels.hostname }}"

Those are the logs that I was receiving:

level=warn ts=2022-04-15T16:21:04.619726789Z caller=bucket_client.go:147 msg="invalid rule group object key found while listing rule groups" user=fake key=bG9raS1ub2Rlcy1ydWxlcw%3D%3D/ err="illegal base64 data at input byte 22"

Expected behavior Not encounter any error and have the ruler be able to list the rules

Environment:

  • Infrastructure: VMs
  • Deployment tool: manual

Storage Engine

  • [X] Blocks
  • [ ] Chunks

Additional Context

PatrikMuniak avatar Apr 15 '22 17:04 PatrikMuniak

What is this bG9raS1ub2Rlcy1ydWxlcw== object?

alanprot avatar Apr 26 '22 18:04 alanprot

What is this bG9raS1ub2Rlcy1ydWxlcw== object?

@alanprot That is the namespace encoded in base64, it corresponds to the filename of the rule file I was trying to upload to cortex. In the bucket that's a folder that contains the rulegroup

PatrikMuniak avatar Apr 26 '22 21:04 PatrikMuniak

Oh Ok..

So basically for some reason the "Hitachi Content Platform" is encoding the response?

bG9raS1ub2Rlcy1ydWxlcw== to bG9raS1ub2Rlcy1ydWxlcw%3D%3D

So i guess the question is.. why this hitachi is encoding the response?

alanprot avatar Apr 26 '22 21:04 alanprot

@alanprot I checked to see if the issue would persist when when defining the s3 config inside the ruler: block and here seems to be working. example:

auth_enabled: true

server:
  http_listen_port: 9008
  grpc_listen_port: 9099
  log_level: debug

  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  grpc_server_max_concurrent_streams: 1000

distributor:
  shard_by_all_labels: true
  pool:
    health_check_ingesters: true

ingester_client:
  grpc_client_config:
    max_recv_msg_size: 104857600
    max_send_msg_size: 104857600
    grpc_compression: gzip

ingester:
  lifecycler:
    interface_names: [ens160] 
    join_after: 0
    min_ready_duration: 0s
    final_sleep: 0s
    num_tokens: 512
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1

storage:
  engine: blocks

blocks_storage:
  tsdb:
    dir: /tmp/cortex/tsdb

  bucket_store:
    sync_dir: /tmp/cortex/tsdb-sync

  backend: s3
  s3:
    bucket_name: eu-cortex-metrics
    endpoint: <endpoint>
    access_key_id: "<id>"
    secret_access_key: "<secret>"

    http:
      insecure_skip_verify: true


compactor:
  data_dir: /tmp/cortex/compactor
  sharding_ring:
    kvstore:
      store: inmemory

frontend_worker:
  match_max_concurrent: true

ruler:
  enable_api: true
  enable_sharding: false
  rule_path: /tmp/cortex/tmp-rules

  storage:
    type: s3
    s3:

      bucketnames: eu-cortex-ruler
      endpoint: <endpoint>
      access_key_id: "<id>"
      secret_access_key: "<secret>"
      http_config:
        insecure_skip_verify: true

I upload the rule with the same cortextool command and it doesn't give me errors

level=debug ts=2022-04-27T09:16:00.690149205Z caller=rule_store.go:147 msg="loading rule group" key="rules/nap-tom/bG9raS1ub2Rlcy1ydWxlcw==/bG9raS1ub2Rlcw==" user=nap-tom

If I switch to configuring the s3 bucket in the ruler_storage: block example:

auth_enabled: true

server:
  http_listen_port: 9008
  grpc_listen_port: 9099
  log_level: debug
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  grpc_server_max_concurrent_streams: 1000

distributor:
  shard_by_all_labels: true
  pool:
    health_check_ingesters: true

ingester_client:
  grpc_client_config:
    max_recv_msg_size: 104857600
    max_send_msg_size: 104857600
    grpc_compression: gzip

ingester:
  lifecycler:
    interface_names: [ens160] 
    join_after: 0
    min_ready_duration: 0s
    final_sleep: 0s
    num_tokens: 512

    ring:
      kvstore:
        store: inmemory
      replication_factor: 1

storage:
  engine: blocks

blocks_storage:
  tsdb:
    dir: /tmp/cortex/tsdb

  bucket_store:
    sync_dir: /tmp/cortex/tsdb-sync

  backend: s3
  s3:
    bucket_name: eu-cortex-metrics
    endpoint: <endpoint>
    access_key_id: "<id>"
    secret_access_key: "<secret>"
    http:
      insecure_skip_verify: true


compactor:
  data_dir: /tmp/cortex/compactor
  sharding_ring:
    kvstore:
      store: inmemory

frontend_worker:
  match_max_concurrent: true

ruler:
  enable_api: true
  enable_sharding: false
  rule_path: /tmp/cortex/tmp-rules

ruler_storage:
  backend: s3
  local:
    directory: /tmp/cortex/rules
  s3:
    bucket_name: eu-cortex-ruler
    endpoint: <endpoint>
    access_key_id: "<id>"
    secret_access_key: "<secret>"
    http:
      insecure_skip_verify: true

Those are the logs I see:

level=warn ts=2022-04-27T09:33:00.421710256Z caller=bucket_client.go:110 msg="invalid rule group object key found while listing rule groups" key=nap-tom/ err="invalid rule group object key"
level=warn ts=2022-04-27T09:33:00.421725842Z caller=bucket_client.go:110 msg="invalid rule group object key found while listing rule groups" key=nap-tom/bG9raS1ub2Rlcy1ydWxlcw%3D%3D/ err="illegal base64 data at input byte 22"
level=warn ts=2022-04-27T09:33:00.421735648Z caller=bucket_client.go:110 msg="invalid rule group object key found while listing rule groups" key=nap-tom/bG9raS1ub2Rlcy1ydWxlcw%3D%3D/bG9raS1ub2Rlcw%3D%3D err="illegal base64 data at input byte 22"

That looks like a Cortex issue

PatrikMuniak avatar Apr 27 '22 09:04 PatrikMuniak

Hum.. Interesting..

On the first case cortex uses the AWS SDK to call S3:

https://github.com/cortexproject/cortex/blob/2177ec0c9eb6b1ceb7d8808d97945e6557055bb8/pkg/ruler/storage.go#L102 https://github.com/cortexproject/cortex/blob/2177ec0c9eb6b1ceb7d8808d97945e6557055bb8/pkg/chunk/aws/s3_storage_client.go#L382

And on the second case we are using minio:

https://github.com/cortexproject/cortex/blob/2177ec0c9eb6b1ceb7d8808d97945e6557055bb8/pkg/ruler/storage.go#L119 https://github.com/cortexproject/cortex/blob/2177ec0c9eb6b1ceb7d8808d97945e6557055bb8/vendor/github.com/thanos-io/thanos/pkg/objstore/s3/s3.go#L247

I wonder if this explains the difference in behaviour here.

alanprot avatar May 04 '22 22:05 alanprot

Hi all,

I'm struggling with the upload of the YAML file to s3. What is the command that you use to upload the rules to s3? Thanks

alvaropalmeirao avatar Oct 06 '23 13:10 alvaropalmeirao

I found the way to do it: cortextool rules sync --backend=loki --id=fake --rule-files=test1.yml --address=https://<LOKI_ADDRESS>

alvaropalmeirao avatar Oct 06 '23 13:10 alvaropalmeirao