loki icon indicating copy to clipboard operation
loki copied to clipboard

Loki 3.0 Feedback and Issues

Open slim-bean opened this issue 2 months ago • 71 comments

If you encounter any troubles upgrading to Loki 3.0 or have feedback for the upgrade process, please leave a comment on this issue!

Also you can ask questions at: https://slack.grafana.com/ in the channel #loki-3

Known Issues:

  • https://github.com/grafana/loki/issues/12540: Panic when using blooms, needs 3.0.1 or there is an image in the issue
  • https://github.com/grafana/loki/issues/12554: WARNING when upgrading the helm the ingress needed to be recreated, we will look to see if we can avoid this, but if 404's are sent to the agents they do not retry these logs (this is a separate issue we should change)
  • Helm chart schema_config was renamed to schemaConfig and this is not documented
  • Helm chart issue: https://github.com/grafana/loki/issues/12506#issuecomment-2047231721

slim-bean avatar Apr 08 '24 14:04 slim-bean

pls update grafana.com/docs/loki before releasing a major update still shows the 2.9 documentation. :)

Hedius avatar Apr 08 '24 21:04 Hedius

I tried upgrading the Helm chart ( 5.47.2 → 6.0.0 ) but encountered these errors:

❯ k -n observability logs loki-write-1
failed parsing config: /etc/loki/config/config.yaml: yaml: unmarshal errors:
  line 41: field shared_store not found in type compactor.Config
  line 62: field enforce_metric_name not found in type validation.plain. Use `-config.expand-env=true` flag if you want to expand environment variables in your config file
❯ k -n observability logs loki-read-779bd69757-rrdxt
failed parsing config: /etc/loki/config/config.yaml: yaml: unmarshal errors:
  line 41: field shared_store not found in type compactor.Config
  line 62: field enforce_metric_name not found in type validation.plain. Use `-config.expand-env=true` flag if you want to expand environment variables in your config file
❯ k logs -n observability loki-backend-1
Defaulted container "loki-sc-rules" out of: loki-sc-rules, loki
{"time": "2024-04-08T21:37:34.546399+00:00", "msg": "Starting collector", "level": "INFO"}
{"time": "2024-04-08T21:37:34.546577+00:00", "msg": "No folder annotation was provided, defaulting to k8s-sidecar-target-directory", "level": "WARNING"}
{"time": "2024-04-08T21:37:34.546733+00:00", "msg": "Loading incluster config ...", "level": "INFO"}
{"time": "2024-04-08T21:37:34.547477+00:00", "msg": "Config for cluster api at 'https://10.43.0.1:443' loaded...", "level": "INFO"}
{"time": "2024-04-08T21:37:34.547598+00:00", "msg": "Unique filenames will not be enforced.", "level": "INFO"}
{"time": "2024-04-08T21:37:34.547695+00:00", "msg": "5xx response content will not be enabled.", "level": "INFO"}

Pretty sure I adjusted all the breaking changes described in the release notes but maybe some of the custom config I have is not compatible?

My Helm values are located here, any help?

onedr0p avatar Apr 08 '24 21:04 onedr0p

I tried upgrading the Helm chart ( 5.47.2 → 6.0.0 ) but encountered these errors:

❯ k -n observability logs loki-write-1
failed parsing config: /etc/loki/config/config.yaml: yaml: unmarshal errors:
  line 41: field shared_store not found in type compactor.Config
  line 62: field enforce_metric_name not found in type validation.plain. Use `-config.expand-env=true` flag if you want to expand environment variables in your config file
❯ k -n observability logs loki-read-779bd69757-rrdxt
failed parsing config: /etc/loki/config/config.yaml: yaml: unmarshal errors:
  line 41: field shared_store not found in type compactor.Config
  line 62: field enforce_metric_name not found in type validation.plain. Use `-config.expand-env=true` flag if you want to expand environment variables in your config file
❯ k logs -n observability loki-backend-1
Defaulted container "loki-sc-rules" out of: loki-sc-rules, loki
{"time": "2024-04-08T21:37:34.546399+00:00", "msg": "Starting collector", "level": "INFO"}
{"time": "2024-04-08T21:37:34.546577+00:00", "msg": "No folder annotation was provided, defaulting to k8s-sidecar-target-directory", "level": "WARNING"}
{"time": "2024-04-08T21:37:34.546733+00:00", "msg": "Loading incluster config ...", "level": "INFO"}
{"time": "2024-04-08T21:37:34.547477+00:00", "msg": "Config for cluster api at 'https://10.43.0.1:443' loaded...", "level": "INFO"}
{"time": "2024-04-08T21:37:34.547598+00:00", "msg": "Unique filenames will not be enforced.", "level": "INFO"}
{"time": "2024-04-08T21:37:34.547695+00:00", "msg": "5xx response content will not be enabled.", "level": "INFO"}

Pretty sure I adjusted all the breaking changes described in the release notes but maybe some of the custom config I have is not compatible?

My Helm values are located here, any help?

You are setting shared store in compactor. It also got dropped there.

See https://github.com/grafana/loki/blob/main/docs/sources/configure/_index.md#compactor

delete_request_store is now required

Hedius avatar Apr 08 '24 22:04 Hedius

So I should just be able to rename shared_store to delete_request_store and be good?

onedr0p avatar Apr 08 '24 22:04 onedr0p

helm template grafana/loki --set loki.useTestSchema=true --set-json imagePullSecrets='["blah"]' fails for me with ...executing "loki.memcached.statefulSet" at <$.ctx.Values.image.pullSecrets>: nil pointer evaluating interface {}.pullSecrets

Adding --set-json image.pullSecrets='["blah2"]' to the previous command does work, but image.pullSecrets isn't documented in values.yaml, and would be kind of redundant, so I think maybe this is a typo for imagePullSecrets here?

alto-rlk avatar Apr 08 '24 22:04 alto-rlk

So I should just be able to rename shared_store to delete_request_store and be good? @onedr0p see https://grafana.com/docs/loki/latest/setup/upgrade/ The shared_store config is removed. Refer to Removed shared_store and shared_store_key_prefix from shipper configuration.

0X11B avatar Apr 09 '24 07:04 0X11B

Since the upgrade everything looks good in our environments although the backend pods seem to be outputting a lot of: level=info ts=2024-04-09T08:01:08.971329289Z caller=gateway.go:241 component=index-gateway msg="chunk filtering is not enabled" with every loki search. Wasn't happening before 3.0 from what we can tell

I suspect that's because blooms aren't enabled although when I do enable blooms we get a nil pointer:

level=info ts=2024-04-09T08:17:29.692174397Z caller=bloomcompactor.go:458 component=bloom-compactor msg=compacting org_id=plprod table=index_19820 ownership=1f6c0f8500000000-1fa8b221ffffffff
ts=2024-04-09T08:17:31.535678052Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'loki-backend-3-2e51d875' from=10.30.80.69:7946"
level=info ts=2024-04-09T08:17:31.610784021Z caller=scheduler.go:653 msg="this scheduler is in the ReplicationSet, will now accept requests."
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x1aec384]

goroutine 1430 [running]:
github.com/grafana/loki/v3/pkg/bloomcompactor.(*SimpleBloomController).buildGaps.OnceFunc.func4.1()
	/usr/local/go/src/sync/oncefunc.go:24 +0x7c
panic({0x2002700?, 0x42aae10?})
	/usr/local/go/src/runtime/panic.go:914 +0x218
github.com/grafana/loki/v3/pkg/bloomcompactor.(*SimpleBloomController).buildGaps.func2()
	/src/loki/pkg/bloomcompactor/controller.go:388 +0x24
github.com/grafana/loki/v3/pkg/bloomcompactor.(*SimpleBloomController).buildGaps.OnceFunc.func4()
	/usr/local/go/src/sync/oncefunc.go:27 +0x64
sync.(*Once).doSlow(0x4006e9f128?, 0x0?)
	/usr/local/go/src/sync/once.go:74 +0x100
sync.(*Once).Do(0x400004e800?, 0x21cc060?)
	/usr/local/go/src/sync/once.go:65 +0x24
github.com/grafana/loki/v3/pkg/bloomcompactor.(*SimpleBloomController).buildGaps.OnceFunc.func5()
	/usr/local/go/src/sync/oncefunc.go:31 +0x34
github.com/grafana/loki/v3/pkg/bloomcompactor.(*SimpleBloomController).buildGaps(0x4006e7e720, {0x2c70e48, 0x4006e6e7d0}, {0x4006867892, 0x6}, {{0x1f6c0f8500000000?}, {0x40005a0578?, 0x4d6c?}}, {0x4321220?, 0x0?}, ...)
	/src/loki/pkg/bloomcompactor/controller.go:396 +0x133c
github.com/grafana/loki/v3/pkg/bloomcompactor.(*SimpleBloomController).compactTenant(0x4006e7e720, {0x2c70e48, 0x4006e6e7d0}, {{0x2?}, {0x40005a0578?, 0x101000000226f98?}}, {0x4006867892, 0x6}, {0x2?, 0x0?}, ...)
	/src/loki/pkg/bloomcompactor/controller.go:115 +0x6a0
github.com/grafana/loki/v3/pkg/bloomcompactor.(*Compactor).compactTenantTable(0x40007eee00, {0x2c70e48, 0x4006e6e7d0}, 0x4001a7eab0, 0x0?)
	/src/loki/pkg/bloomcompactor/bloomcompactor.go:460 +0x2e8
github.com/grafana/loki/v3/pkg/bloomcompactor.(*Compactor).runWorkers.func2({0x2c70e48, 0x4006e6e7d0}, 0x0?)
	/src/loki/pkg/bloomcompactor/bloomcompactor.go:422 +0xe0
github.com/grafana/dskit/concurrency.ForEachJob.func1()
	/src/loki/vendor/github.com/grafana/dskit/concurrency/runner.go:105 +0xbc
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/src/loki/vendor/golang.org/x/sync/errgroup/errgroup.go:78 +0x58
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 1428
	/src/loki/vendor/golang.org/x/sync/errgroup/errgroup.go:75 +0x98

rknightion avatar Apr 09 '24 08:04 rknightion

When upgrading, the pod from the new stateful set 'loki-chunks-cache' couldn't be scheduled, because none of our nodes offer the requested 9830 MiB of memory.

nomaster avatar Apr 09 '24 08:04 nomaster

pls update grafana.com/docs/loki before releasing a major update still shows the 2.9 documentation. :)

very sorry about this, we are working on a new release processes and also had problems with our documentation updates, I think there are still a few things we are working out but hopefully most of it is correct now.

slim-bean avatar Apr 09 '24 14:04 slim-bean

When upgrading, the pod from the new stateful set 'loki-chunks-cache' couldn't be scheduled, because none of our nodes offer the requested 9830 MiB of memory.

You could disable this external memcached entirely by setting enabled: false

or you can make it smaller by reducing allocatedMemory this will also automatically adjust the pod requests in k8s!

chunksCache:
  # -- Specifies whether memcached based chunks-cache should be enabled
  enabled: true
  # -- Amount of memory allocated to chunks-cache for object storage (in MB).
  allocatedMemory: 8192

slim-bean avatar Apr 09 '24 14:04 slim-bean

Awesome with the new bloom filter, for unique IDs etc! 🎉

I'm looking forward to close issue https://github.com/grafana/loki/issues/91 (from 2018) when the experimental bloom filters are stable. 😄

Regarding docs, some feedback:

  • Would be nice to rewrite the 'Simple Scalable' to not assume Kubernetes. For example, move sentences such as "The write target is stateful and is controlled by a Kubernetes StatefulSet." into a separate sub-heading, named kubernetes details. That way, the general description of the simple scalable deployment mode doesn't need to dig into details on how to deploy it under kubernetes.
  • Clarify why the write-target and backend-targets are stateful. I thought any state was on S3 or in configuration files. Is this 'state' the WAL, or cached chunks on disk before being flushed to S3 (or other object storage)? If so, maybe clarify this.
  • Update the architecture section and skip any mention of BoltDB and other legacy stuff, it's just confusing. Include only information related to how it's operated under a regular 3.0 deployment (you can still keep old 1.x docs about BoltDB, just remove it from 3.x docs).
    • Same in https://grafana.com/docs/loki/latest/configure/storage/, don't mention any deprecated/old stuff.
  • More of a feature request, but rename "fake" to "default", it's confusing: https://grafana.com/docs/loki/latest/get-started/architecture/#multi-tenancy
  • Update the docs to reflect 3.0, currently it says "For release 2.9 the components are:…"
  • Update https://grafana.com/docs/loki/latest/operations/storage/retention/ and explain how to use life-cycle rules on S3 (or similar) to handle retention. Remove legacy stuff here too.

Source: https://grafana.com/docs/loki/latest/get-started/deployment-modes/

sandstrom avatar Apr 09 '24 17:04 sandstrom

Trying to update helm chart 5.43.2 to 6.1.0 but i am getting

UPGRADE FAILED: template: loki/templates/single-binary/statefulset.yaml:44:28: executing "loki/templates/single-binary/statefulset.yaml" at <include (print .Template.BasePath "/config.yaml") .>: error calling include: template: loki/templates/config.yaml:19:7: executing "loki/templates/config.yaml" at <include "loki.calculatedConfig" .>: error calling include: template: loki/templates/_helpers.tpl:461:24: executing "loki.calculatedConfig" at <tpl .Values.loki.config .>: error calling tpl: error during tpl function execution for "{{- if .Values.enterprise.enabled}}\n{{- tpl .Values.enterprise.config . }}\n{{- else }}\nauth_enabled: {{ .Values.loki.auth_enabled }}\n{{- end }}\n\n{{- with .Values.loki.server }}\nserver:\n  {{- toYaml . | nindent 2}}\n{{- end}}\n\nmemberlist:\n{{- if .Values.loki.memberlistConfig }}\n  {{- toYaml .Values.loki.memberlistConfig | nindent 2 }}\n{{- else }}\n{{- if .Values.loki.extraMemberlistConfig}}\n{{- toYaml .Values.loki.extraMemberlistConfig | nindent 2}}\n{{- end }}\n  join_members:\n    - {{ include \"loki.memberlist\" . }}\n    {{- with .Values.migrate.fromDistributed }}\n    {{- if .enabled }}\n    - {{ .memberlistService }}\n    {{- end }}\n
  {{- end }}\n{{- end }}\n\n{{- with .Values.loki.ingester }}\ningester:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- if .Values.loki.commonConfig}}\ncommon:\n{{- toYaml .Values.loki.commonConfig | nindent 2}}\n  storage:\n  {{- include \"loki.commonStorageConfig\" . | nindent 4}}\n{{- end}}\n\n{{- with .Values.loki.limits_config }}\nlimits_config:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\nruntime_config:\n  file: /etc/loki/runtime-config/runtime-config.yaml\n\n{{- with .Values.chunksCache }}\n{{- if .enabled }}\nchunk_store_config:\n  chunk_cache_config:\n    default_validity: {{ .defaultValidity }}\n    background:\n      writeback_goroutines: {{ .writebackParallelism }}\n      writeback_buffer: {{ .writebackBuffer }}\n      writeback_size_limit: {{ .writebackSizeLimit }}\n    memcached:\n      batch_size: {{ .batchSize }}\n      parallelism: {{ .parallelism }}\n    memcached_client:\n      addresses: dnssrvnoa+_memcached-client._tcp.{{ template \"loki.fullname\" $ }}-chunks-cache.{{ $.Release.Namespace }}.svc\n      consistent_hash: true\n      timeout: {{ .timeout }}\n      max_idle_conns: 72\n{{- end }}\n{{- end }}\n\n{{- if .Values.loki.schemaConfig }}\nschema_config:\n{{- toYaml .Values.loki.schemaConfig | nindent 2}}\n{{- end }}\n\n{{- if .Values.loki.useTestSchema }}\nschema_config:\n{{- toYaml .Values.loki.testSchemaConfig | nindent 2}}\n{{- end }}\n\n{{ include \"loki.rulerConfig\" . }}\n\n{{- if or .Values.tableManager.retention_deletes_enabled .Values.tableManager.retention_period }}\ntable_manager:\n  retention_deletes_enabled: {{ .Values.tableManager.retention_deletes_enabled }}\n  retention_period: {{ .Values.tableManager.retention_period }}\n{{- end }}\n\nquery_range:\n  align_queries_with_step: true\n  {{- with .Values.loki.query_range }}\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n  {{- end }}\n  {{- if .Values.resultsCache.enabled }}\n  {{- with .Values.resultsCache }}\n  cache_results: true\n  results_cache:\n    cache:\n      default_validity: {{ .defaultValidity }}\n      background:\n        writeback_goroutines: {{ .writebackParallelism }}\n        writeback_buffer: {{ .writebackBuffer }}\n        writeback_size_limit: {{ .writebackSizeLimit }}\n      memcached_client:\n        consistent_hash: true\n        addresses: dnssrvnoa+_memcached-client._tcp.{{ template \"loki.fullname\" $ }}-results-cache.{{ $.Release.Namespace }}.svc\n        timeout: {{ .timeout }}\n        update_interval: 1m\n  {{- end }}\n  {{- end }}\n\n{{- with .Values.loki.storage_config }}\nstorage_config:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.query_scheduler }}\nquery_scheduler:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.compactor }}\ncompactor:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.analytics }}\nanalytics:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.querier }}\nquerier:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.index_gateway }}\nindex_gateway:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.frontend }}\nfrontend:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.frontend_worker }}\nfrontend_worker:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.distributor }}\ndistributor:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\ntracing:\n  enabled: {{ .Values.loki.tracing.enabled }}\n": template: loki/templates/single-binary/statefulset.yaml:37:6: executing "loki/templates/single-binary/statefulset.yaml" at <include "loki.commonStorageConfig" .>: error calling include: template: loki/templates/_helpers.tpl:228:19: executing "loki.commonStorageConfig" at <$.Values.loki.storage.bucketNames.chunks>: nil pointer evaluating interface {}.chunks

K1kc4 avatar Apr 10 '24 08:04 K1kc4

For the loki helm chart: https://github.com/grafana/loki/pull/12067 changed the port name for the gateway service from http to http-metrics which caused it to be picked up by the loki ServiceMonitor.

The gateway responds with a 404 on the /metrics path causing the prometheus target to fail.

AllexVeldman avatar Apr 10 '24 08:04 AllexVeldman

For the loki chart we unfortunately had to face some downtime.

This changed https://github.com/grafana/loki/commit/79b876b65d55c54f4d532e98dc24743dea8bedec#diff-89f4fd98934eb0f277b921d45e4c223e168490c44604e454a2192d28dab1c3e2R4 forced the recreation of all the gateway resources: Deployment, Service, PodDisruptionBudget and most critical Ingress.

This is problematic for 2 reasons:

  • The deployment and service will immediately get traffic even though the pods are literally starting and most likely still in the ImagePull phase.
  • Replacing an ingress with the exact same hostname and path combination is problematic if you are running nginx ingress, as it is the case for a really good chunk of the community. This is in part because of their strict validating webhook that doesn't allow duplicate ingresses of that type. The only solution was to delete the ingress and quickly sync it, causing some downtime. Unfortunately promtail wasn't able to recover and send the accumulated log data. This is because it doesn't retry on 404 errors that happen if the ingress is deleted.

tete17 avatar Apr 10 '24 09:04 tete17

Two issues so far with my existing Helm values:

loki.schema_config apparently became loki.schemaConfig. After renaming the object, that part was accepted (also by the 5.x helm chart).

Then the loki ConfigMap failed to be generated. The config.yaml value is literally Error: 'error converting YAML to JSON: yaml: line 70: mapping values are not allowed in this context'.

Trying to render the helm chart locally with "helm --debug template" results in

Error: template: loki/templates/write/statefulset-write.yaml:46:28: executing "loki/templates/write/statefulset-write.yaml" at <include (print .Template.BasePath "/config.yaml") .>: error calling include: template: loki/templates/config.yaml:19:7: executing "loki/templates/config.ya
ml" at <include "loki.calculatedConfig" .>: error calling include: template: loki/templates/_helpers.tpl:461:24: executing "loki.calculatedConfig" at <tpl .Values.loki.config .>: error calling tpl: error during tpl function execution for "
<<<< template removed for brevity >>>
": template: loki/templates/write/statefulset-write.yaml:37:6: executing "loki/templates/write/statefulset-write.yaml" at <include "loki.commonStorageConfig" .>: error calling include: template: loki/templates/_helpers.tpl:228:19: executing "loki.commonStorageConfig" at <$.Values.loki.storage.bucketNames.chunks>: nil pointer evaluating interface {}.chunks

I try to understand the nested template structure in the helm chart to understand what is happening.

A short helm chart values set (which worked fine with 5.x) triggering the phenomenon:

values.yaml
serviceAccount:
  create: false
  name: loki
test:
  enabled: false
monitoring:
  dashboards:
    enable: false
  lokiCanary:
    enabled: false
  selfMonitoring:
    enabled: false
    grafanaAgent:
      installOperator: false
loki:
  auth_enabled: false
  limits_config:
    max_streams_per_user: 10000
    max_global_streams_per_user: 10000
  storage_config:
    aws:
      s3: s3://eu-central-1
      bucketnames: my-bucket-name
  schemaConfig:
    configs:
      - from: 2024-01-19
        store: tsdb
        object_store: aws
        schema: v11
        index:
          prefix: "some-prefix_"
          period: 24h
  query_range:
    split_queries_by_interval: 0
  query_scheduler:
    max_outstanding_requests_per_tenant: 8192
  analytics:
    reporting_enabled: false
  compactor:
    shared_store: s3
gateway:
  replicas: 3
read:
  replicas: 3
write:
  replicas: 3
compactor:
  enable: true

MartinEmrich avatar Apr 10 '24 11:04 MartinEmrich

hahaha image

I thought I recognized that github picture!!!

I'm looking forward to close issue https://github.com/grafana/loki/issues/91 (from 2018) when the experimental bloom filters are stable. 😄

2018!!!

Thanks for the great feedback on the docs, very helpful.

One note regarding SSD mode, honestly the original idea of SSD was to make Loki a lot more friendly outside of k8s environments, the problem we found ourselves in though is that we have had no good ability to support customers attempting to run Loki this way and as such we largely require folks to use kubernetes for our commercial offering. This is why the docs are so k8s specific.

It continues to be a struggle to build an open source project which is extremely flexible for folks to run in many ways, but also a product that we have to provide support for.

I'd love to know though how many folks are successfully running SSD mode outside of kubernetes. I'm still a bit bullish on the idea but over time I kind of feel like it hasn't played out as well as we hoped.

slim-bean avatar Apr 10 '24 11:04 slim-bean

For the loki helm chart: #12067 changed the port name for the gateway service from http to http-metrics which caused it to be picked up by the loki ServiceMonitor.

The gateway responds with a 404 on the /metrics path causing the prometheus target to fail.

oh interesting, we'll take a look at this, not sure what happened here, thanks!

slim-bean avatar Apr 10 '24 11:04 slim-bean

@tete17 I created a new issue for what you found https://github.com/grafana/loki/issues/12554

Thank you for reporting, sorry for the troubles :(

slim-bean avatar Apr 10 '24 11:04 slim-bean

@MartinEmrich thank you, I will update the upgrade guide around schemaConfig, sorry about that. And thank you for the sample test values file! very helpful!

slim-bean avatar Apr 10 '24 12:04 slim-bean

Congratulations on the release! :tada: :) Is there any way to verify that bloom filters are active and working? I cannot seem to find any metrics or log entries that might give a hint. There are also no bloom services listed on the /services endpoint:

curl -s -k https://localhost:3100/services
ruler => Running
compactor => Running
store => Running
ingester-querier => Running
query-scheduler => Running
ingester => Running
query-frontend => Running
distributor => Running
server => Running
ring => Running
query-frontend-tripperware => Running
analytics => Running
query-scheduler-ring => Running
querier => Running
cache-generation-loader => Running
memberlist-kv => Running

I tried deploying it on a single instance in monolithic mode via Docker by adding the following options:


limits_config:
  bloom_gateway_enable_filtering: true
  bloom_compactor_enable_compaction: true

bloom_compactor:
  enabled: true
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

bloom_gateway:
  enabled: true
  client:
    addresses: dns+localhost.localdomain:9095

Edit: My bad, it seems that the bloom components are not available when using -target=all. It needs to be set to -target=all,bloom-compactor,bloom-gateway,bloom-store for a monolithic deployment I guess? See https://grafana.com/docs/loki/latest/get-started/components/#loki-components.

MarcBrendel avatar Apr 10 '24 14:04 MarcBrendel

not sure if this is intended but in the _helpers.tpl there is an if check which might be wrong:

{{- if "loki.deployment.isDistributed "}}

similar check is done here which looks like this:

{{- $isDistributed := eq (include "loki.deployment.isDistributed" .) "true" -}}
{{- if $isDistributed -}}

This causes the if check to always be true and thus the frontend.tail_proxy_url to be set in the loki config. But the configured tail_proxy_url does not point to an existing service (I used SSD deplyoment mode). Not sure if this has any impact.

dakr0013 avatar Apr 10 '24 14:04 dakr0013

We encountered a bug in the rendering of the Loki config with the helm chart v6.0.0 that may be similar to what @MartinEmrich encountered above. These simple values will cause the rendering to fail:

loki:
  query_range:
    parallelise_shardable_queries: false
  useTestSchema: true

This causes .Values.loki.config to look like (note the extra indent):

query_range:
  align_queries_with_step: true
    parallelise_shardable_queries: false
  cache_results: true

I believe anything under loki.query_range is being misindented here.

EDIT: I've added a PR to solve the above but in general we've had trouble upgrading to Helm chart v6 as there are now two fields which are seemingly necessary where before they were not, and they're not listed in the upgrade guide:

  • As of 6.0: we must provide a schemaConfig whereas in v5 we could use a suggested default without needing a useTestSchema flag.
  • As of 6.1: we must provide storage defaults otherwise templating fails (see this comment).

In general I would personally prefer that I can always install a Helm chart with no values and get some kind of sensible default, even if only for testing out the chart. Later, when I want to go production-ready, I can tweak those parameters to something more appropriate.

coro avatar Apr 11 '24 14:04 coro

On the upgrade attempt using Simple Scalable mode scheduler_address is empty in the rendered config, whilst present before upgrade:

    frontend:
      scheduler_address: ""
      tail_proxy_url: http://loki-querier.grafana.svc.gke-main-a.us-east1:3100
    frontend_worker:
      scheduler_address: ""

It looks like schedulerAddress is defined only for the Distributed mode, note, service query-scheduler-discovery is still created

maksym-iv avatar Apr 11 '24 20:04 maksym-iv

We encountered a bug in the rendering of the Loki config with the helm chart v6.0.0 that may be similar to what @MartinEmrich encountered above. These simple values will cause the rendering to fail:

loki:
  query_range:
    parallelise_shardable_queries: false
  useTestSchema: true

This causes .Values.loki.config to look like (note the extra indent):

query_range:
  align_queries_with_step: true
    parallelise_shardable_queries: false
  cache_results: true

I believe anything under loki.query_range is being misindented here.

EDIT: I've added a PR to solve the above but in general we've had trouble upgrading to Helm chart v6 as there are now two fields which are seemingly necessary where before they were not, and they're not listed in the upgrade guide:

* As of 6.0: we must provide a `schemaConfig` whereas in v5 we could use a suggested default without needing a `useTestSchema` flag.

* As of 6.1: we must provide storage defaults otherwise templating fails (see [this comment](https://github.com/grafana/loki/pull/12548#issuecomment-2046492619)).

In general I would personally prefer that I can always install a Helm chart with no values and get some kind of sensible default, even if only for testing out the chart. Later, when I want to go production-ready, I can tweak those parameters to something more appropriate.

Very helpful feedback, thank you!

The schemaConfig name change was an oversight on my part and I need to get it into the upgrade guide, apologies.

The forced requirement for a schemaConfig is an interesting problem, if we default it in the chart then people end up using it which means we can't change it without breaking their clusters because schemas can't be changed, only new ones added. I do supposed we could just add new ones but that feels a bit like forcing an upgrade on someone... I'm not sure, this is a hard problem that I don't have great answers to.

We decided that this time around we'd force people to define a schema, and provide the test schema config value that should be spit out in an error message if you want to just try the chart with data you plan on throwing away. It does seem like we need to update this error or that flag to also provide values for the storage defaults however.

slim-bean avatar Apr 11 '24 23:04 slim-bean

On the upgrade attempt using Simple Scalable mode scheduler_address is empty in the rendered config, whilst present before upgrade:

    frontend:
      scheduler_address: ""
      tail_proxy_url: http://loki-querier.grafana.svc.gke-main-a.us-east1:3100
    frontend_worker:
      scheduler_address: ""

It looks like schedulerAddress is defined only for the Distributed mode, note, service query-scheduler-discovery is still created

Good eye, and interestingly this is to be expected. In SSD mode loki can resolve the scheduler addresses using the same mechanisms of how we communicate an ownership hash ring via memberlist for other things in Loki. Setting the scheduler address to an empty string enables this behavior by default.

It used to work this way a long time ago but there was unfortunately a bug released in a version of Loki which broke it and a workaround was to set the scheduler_address in helm. the bug was fixed long ago so I returned this to the preferred behavior of letting Loki figure this out itself.

slim-bean avatar Apr 11 '24 23:04 slim-bean

@slim-bean

Good eye, and interestingly this is to be expected. In SSD mode loki can resolve the scheduler addresses using the same mechanisms of how we communicate an ownership hash ring via memberlist for other things in Loki. Setting the scheduler address to an empty string enables this behavior by default.

It used to work this way a long time ago but there was unfortunately a bug released in a version of Loki which broke it and a workaround was to set the scheduler_address in helm. the bug was fixed long ago so I returned this to the preferred behavior of letting Loki figure this out itself.

Thx for the clarification, that makes sense.

So, initially, on upgrade attempt I received the errors from read/write like:

level=error ts=2024-04-11T19:54:26.700159188Z caller=ring_watcher.go:56 component=querier component=querier-scheduler-worker msg="error getting addresses from ring" err="empty ring"

After more attempts with empty scheduler_address following has worked

  1. Set following config under loki key in the helm chart, basically set the previous query-scheduler-discovery for scheduler_address and apply:

query_scheduler: use_scheduler_ring: true frontend: scheduler_address: query-scheduler-discovery.grafana.svc.some-gke.us-east1.:9095 frontend_worker: scheduler_address: query-scheduler-discovery.grafana.svc.some-gke.us-east1.:9095

2. Wait for healthy state, remove the `scheduler_address` and apply
 ```yaml
query_scheduler:
  use_scheduler_ring: true
  1. Wait for healthy state, remove the use_scheduler_ring: true and apply

However there is a high chance I've not waited enough for the backend to roll-out and maybe if it would have rolled out error would just go away eventually.

So I wonder if the upgrade path I've used looks correct

maksym-iv avatar Apr 12 '24 09:04 maksym-iv

@MartinEmrich thank you, I will update the upgrade guide around schemaConfig, sorry about that. And thank you for the sample test values file! very helpful!

@slim-bean Today I noticed that it actually made a difference: Apparently the "schema_config" before has only worked somehow "half": While store:, object_store: etc. seems to have worked before (it did use AWS all the time, and never complained), the index.prefix was only effective after switching to schemaConfig. In effect my Loki now only sees the logs from three days ago on.

(And being new to Loki, I did not look at the object structure or the file contents to validate that my custom index prefix was effective).

Can I make the older logs still searchable somehow? (Worst case: renaming AWS S3 objects systematically)? A new "schema" block with a "from" date probably won't fit, as it does only allow a date, not a komplete datetime....

MartinEmrich avatar Apr 12 '24 10:04 MartinEmrich

One question about 3.0 release: is there an eta on the docker images for it?

I'm way excited about the otel support but I currently run my loki from docker hub and was kind of confused to not see the new version there!

I'm sorry if this isn't the right place to ask

lsunsi avatar Apr 13 '24 16:04 lsunsi

One question about 3.0 release: is there an eta on the docker images for it?

I'm way excited about the otel support but I currently run my loki from docker hub and was kind of confused to not see the new version there!

I'm sorry if this isn't the right place to ask

The latest tag is 3.0 and there is a 3.0.0 tagged image?

Hedius avatar Apr 13 '24 17:04 Hedius

One question about 3.0 release: is there an eta on the docker images for it?

I'm way excited about the otel support but I currently run my loki from docker hub and was kind of confused to not see the new version there!

I'm sorry if this isn't the right place to ask

The latest tag is 3.0 and there is a 3.0.0 tagged image?

I'm so sorry, you're right, I got confused due to docker hub ordering and probably the voices in my head.

Is it better to delete my question to avoid further confusion, or just leave it?

lsunsi avatar Apr 13 '24 17:04 lsunsi