snuba icon indicating copy to clipboard operation
snuba copied to clipboard

snuba-subscription-consumer-* containers are failing continuously

Open sree-warrier opened this issue 1 year ago • 21 comments

Self-Hosted Version

23.11.2

CPU Architecture

x86_64

Docker Version

NA

Docker Compose Version

NA

Steps to Reproduce

Seeing following containers been crashing continuously. Is this services used for alerting ? Have little confusions now on the services functionality.

snuba-subscription-consumer-events
snuba-subscription-consumer-metrics
snuba-subscription-consumer-transactions

Logs:

2024-07-20 15:57:07,088 Initializing Snuba...
2024-07-20 15:57:10,884 Snuba initialization took 3.7952772620010364s
{"module": "builtins", "event": "Checking Clickhouse connections", "severity": "info", "timestamp": "2024-07-20T15:57:10.897290Z"}
2024-07-20 15:57:10,966 New partitions assigned: {Partition(topic=Topic(name='snuba-commit-log'), index=0): 0, Partition(topic=Topic(name='snuba-commit-log'), index=1): 0, Partition(topic=Topic(name='snuba-commit-log'), index=2): 0, Partition(topic=Topic(name='snuba-commit-log'), index=3): 0, Partition(topic=Topic(name='snuba-commit-log'), index=4): 0}
2024-07-20 15:57:10,979 Caught exception, shutting down...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 294, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 382, in _run_once
    self.__processing_strategy.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/scheduler_processing_strategy.py", line 240, in submit
    self.__next_step.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/combined_scheduler_executor.py", line 275, in submit
    tasks.extend([task for task in entity_scheduler[tick.partition].find(tick)])
KeyError: 2
2024-07-20 15:57:10,981 Closing <snuba.subscriptions.scheduler_consumer.CommitLogTickConsumer object at 0x7cba41de5f70>...
2024-07-20 15:57:10,983 Partitions to revoke: [Partition(topic=Topic(name='snuba-commit-log'), index=0), Partition(topic=Topic(name='snuba-commit-log'), index=1), Partition(topic=Topic(name='snuba-commit-log'), index=2), Partition(topic=Topic(name='snuba-commit-log'), index=3), Partition(topic=Topic(name='snuba-commit-log'), index=4)]
2024-07-20 15:57:10,983 Partition revocation complete.
2024-07-20 15:57:10,987 Processor terminated
Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/subscriptions_scheduler_executor.py", line 153, in subscriptions_scheduler_executor
    processor.run()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 294, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 382, in _run_once
    self.__processing_strategy.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/scheduler_processing_strategy.py", line 240, in submit
    self.__next_step.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/combined_scheduler_executor.py", line 275, in submit
    tasks.extend([task for task in entity_scheduler[tick.partition].find(tick)])
KeyError: 2

Alerting system were working fine. We made few changes with kafka partitions after that we saw only these 3 containers were down.

  • Initially increased kafka partition for ingest-events and events from 1 to 5 for scale testing
  • We saw only these 3 services were getting down with above error
  • Followed the steps updated in this issue https://github.com/getsentry/self-hosted/issues/2067
  • Tried to clear all lags and offset, still didnt worked out
  • Recreated the topics as per solution mentioned in above issue, it didnt worked out.
  • We deleted all existing alerts and recreated, now alerts are working. But still the containers are in failed state.

Have little confusions now on these services functionality. Which service is now serving the alerting ?

Suspecting some issue with partition mis-match(please do correct us if this is not related to it), so have increased all the topics partition to 5. Currently review all topic configs, seeing these 3 topics snuba-commit-log, events-subscription-results and ingest-monitors having a ReplicationFactor of 3 rest all topic is having ReplicationFactor as 1, remaining all configs remains same now.

Also while listing out consumer-groups seeing following having no active members

Consumer group 'snuba-transactions-subscriptions-consumers' has no active members.
Consumer group 'snuba-events-subscriptions-consumers' has no active members.
Consumer group 'sentry-commit-log-6e1d91f6451a11ef8ad962551908ad8e' has no active members.
Consumer group 'nuba-metrics-subscriptions-consumers' has no active members.
Consumer group 'sentry-commit-log-12e82a30451a11efb933c2a760684d4c' has no active members.

Do let us know if any other information needed.

Expected Result

NA

Actual Result

NA

Event ID

No response

sree-warrier avatar Jul 20 '24 16:07 sree-warrier

I think this may be a duplicate of https://github.com/getsentry/snuba/issues/5855#issuecomment-2113149084

untitaker avatar Dec 10 '24 19:12 untitaker

Any updates on this? I was able to fix snuba-subscription-consumer-transactions by recreating the corresponding topic but for snuba-subscription-consumer-metrics , that did not work

chipzzz avatar Dec 31 '24 19:12 chipzzz

@mcannizz

chipzzz avatar Dec 31 '24 19:12 chipzzz

Is this simply attributed to Sentry currently not polling this data as it's not needed? Hence kafka states no active members? Still the crashloop should be fixed.

UPDATE: However, once I reset offset it does have active members and the problem persists. (Or when the pod is not crashing)

chipzzz avatar Jan 02 '25 21:01 chipzzz

Please refer to the issue comment I linked above and ensure you are not running the commit-log topic with more than one partition.

untitaker avatar Jan 03 '25 16:01 untitaker

I have/had all commit-log topics set to 1 partition and 1 replica all along as well have them all defined in the topic_partition_counts and all set to 1. This problem became apparent when I was doing some regular operations like editing the topic_partition_counts and updating basic configs.

I can't get passed this events is fine metrics and transactions is not.

seborys40 avatar Jan 03 '25 18:01 seborys40

Sorry logged in another account ** events is fine, transactions is fine but metrics is not.

chipzzz avatar Jan 03 '25 18:01 chipzzz

@untitaker , is this at all related to removal of beta metrics feature from sentry? We recently did remove the metrics beta feature . I wonder if some components can now be removed from deployment... although I see the metric topics are still having data sent to them.

what are these exactly responsible for ?

  • sentry-snuba-subscription-consumer-metrics
  • sentry-snuba-subscription-consumer-transactions
  • sentry-snuba-subscription-consumer-events

chipzzz avatar Jan 03 '25 18:01 chipzzz

@chipzzz metrics is for release health (crashed sessions etc in releases tab), generic-metrics is for the beta metrics feature you mention, transactions is for Performance product in general, events is for errors

generally, deployments with "subscription" in the name are for alerts. if you don't need alerts on crashed sessions/performance data/errors respectively, you can just remove those deployments

if you have further questions like this I suggest filing a separate issue from this one, which should be focused on the bugs IMO

untitaker avatar Jan 03 '25 18:01 untitaker

@untitaker , Am still using the aforementioned, except beta metrics. Unclear though what else could be causing this.

chipzzz avatar Jan 03 '25 18:01 chipzzz

This may be related to this https://github.com/getsentry/self-hosted/pull/3106#issue-2332027811

chipzzz avatar Jan 07 '25 19:01 chipzzz

For reference including https://github.com/getsentry/snuba/pull/2666

chipzzz avatar Jan 07 '25 19:01 chipzzz

This was removed https://github.com/getsentry/snuba/pull/3623 but still referenced here https://github.com/getsentry/snuba/blob/24.7.1/snuba/subscriptions/scheduler_processing_strategy.py#L210

chipzzz avatar Jan 07 '25 20:01 chipzzz

Resolved the issue.

These consumers

  • sentry-snuba-subscription-consumer-metrics
  • sentry-snuba-subscription-consumer-transactions
  • sentry-snuba-subscription-consumer-events

Also depend on other topics and not just commit-log topics, these are

  • snuba-metrics
  • transactions
  • events

In my UAT environment I had an increased number of partitions for these topics but did not have a matching number of consumers to consume from all partitions, hence the key error.

So you must have a matching number of partitions to corresponding consumers consuming them. I tested with other topics/consumers but was not aware snuba-metrics, transactions, events topic were also associated.

However, I am not sure how this problem became apparent as It was always set up this way.

chipzzz avatar Jan 08 '25 16:01 chipzzz

So you must have a matching number of partitions to corresponding consumers consuming them. I tested with other topics/consumers but was not aware snuba-metrics, transactions, events topic were also associated.

@chipzzz can you explain what you mean?

I have 1 partition with 1 replica for "snuba-transactions-commit-log" и "transactions" topics, but worker "sentry-snuba-subscription-consumer-transactions" still crashing

  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/snuba/snuba/cli/subscriptions_scheduler_executor.py", line 153, in subscriptions_scheduler_executor
    processor.run()
  File "/usr/local/lib/python3.11/site-packages/arroyo/processing/processor.py", line 322, in run
    self._run_once()
  File "/usr/local/lib/python3.11/site-packages/arroyo/processing/processor.py", line 410, in _run_once
    self.__processing_strategy.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/scheduler_processing_strategy.py", line 252, in submit
    self.__next_step.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/combined_scheduler_executor.py", line 275, in submit
    tasks.extend([task for task in entity_scheduler[tick.partition].find(tick)])
                                   ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: 8

tuchinsky avatar Feb 08 '25 18:02 tuchinsky

@tuchinsky , make sure you also define the partition count in snuba config so it's aware.

https://stackoverflow.com/questions/73294110/keyerror-1-sentry-snuba-subscription-consumer-events

but essentially yes the partition count must match the consumer count. Given you hav key error of 8 look at your topics or consumers that have at least 8 partitions or you have atleast 8 consumers or more somewhere. So Verfiy the topics and verify the snuba config and verify consumer count, make sure that counts all align.

seborys40 avatar Feb 08 '25 19:02 seborys40

@seborys40 I didn't have this parameter in the config. I check topics configuration in kafka and then added it to snuba settings, but nothing changed - only sentry-snuba-subscription-consumer-transactions worker still crashing with the same error

TOPIC_PARTITION_COUNTS = {
    "buffered-segments": 20,
    "buffered-segments-dlq": 20,
    "cdc": 1,
    "event-replacements": 1,
    "events": 1,
    "events-subscription-results": 1,
    "generic-events": 1,
    "generic-metrics-subscription-results": 1,
    "group-attributes": 1,
    "ingest-attachments": 20,
    "ingest-attachments-dlq": 20,
    "ingest-events": 20,
    "ingest-events-dlq": 20,
    "ingest-feedback-events": 20,
    "ingest-feedback-events-dlq": 20,
    "ingest-generic-metrics-dlq": 20,
    "ingest-metrics": 20,
    "ingest-metrics-dlq": 20,
    "ingest-monitors": 20,
    "ingest-occurrences": 20,
    "ingest-performance-metrics": 50,
    "ingest-replay-events": 1,
    "ingest-replay-recordings": 20,
    "ingest-transactions": 20,
    "ingest-transactions-dlq": 20,
    "metrics-subscription-results": 1,
    "monitors-clock-tasks": 20,
    "monitors-clock-tick": 20,
    "outcomes": 1,
    "outcomes-billing": 20,
    "processed-profiles": 1,
    "profiles": 20,
    "profiles-call-tree": 1,
    "scheduled-subscriptions-events": 1,
    "scheduled-subscriptions-generic-metrics-counters": 1,
    "scheduled-subscriptions-generic-metrics-distributions": 1,
    "scheduled-subscriptions-generic-metrics-gauges": 1,
    "scheduled-subscriptions-generic-metrics-sets": 1,
    "scheduled-subscriptions-metrics": 1,
    "scheduled-subscriptions-transactions": 1,
    "shared-resources-usage": 1,
    "snuba-commit-log": 1,
    "snuba-dead-letter-generic-events": 1,
    "snuba-dead-letter-generic-metrics": 1,
    "snuba-dead-letter-group-attributes": 1,
    "snuba-dead-letter-metrics": 1,
    "snuba-dead-letter-querylog": 1,
    "snuba-dead-letter-replays": 1,
    "snuba-generic-events-commit-log": 1,
    "snuba-generic-metrics": 1,
    "snuba-generic-metrics-counters-commit-log": 1,
    "snuba-generic-metrics-distributions-commit-log": 1,
    "snuba-generic-metrics-gauges-commit-log": 1,
    "snuba-generic-metrics-sets-commit-log": 1,
    "snuba-metrics": 1,
    "snuba-metrics-commit-log": 1,
    "snuba-metrics-summaries": 1,
    "snuba-profile-chunks": 1,
    "snuba-queries": 1,
    "snuba-spans": 1,
    "snuba-transactions-commit-log": 1,
    "transactions": 1,
    "transactions-subscription-results": 1,
    "uptime-configs": 20,
    "uptime-results": 20
}

my containers replica count:

sentry-billing-metrics-consumer 1
sentry-cron 1
sentry-generic-metrics-consumer 5
sentry-ingest-consumer-attachments 5
sentry-ingest-consumer-events 5
sentry-ingest-consumer-transactions 5
sentry-ingest-monitors 3
sentry-ingest-occurrences 3
sentry-ingest-profiles 5
sentry-ingest-replay-recordings 3
sentry-metrics 1
sentry-metrics-consumer 3
sentry-post-process-forward-errors 1
sentry-post-process-forward-issue-platform 1
sentry-post-process-forward-transactions 1
sentry-relay 2
sentry-snuba-api 3
sentry-snuba-consumer 1
sentry-snuba-generic-metrics-counters-consumer 1
sentry-snuba-generic-metrics-distributions-consumer 1
sentry-snuba-generic-metrics-sets-consumer 1
sentry-snuba-group-attributes-consumer 1
sentry-snuba-issue-occurrence-consumer 1
sentry-snuba-metrics-consumer 1
sentry-snuba-outcomes-billing-consumer 5
sentry-snuba-outcomes-consumer 1
sentry-snuba-profiling-functions-consumer 1
sentry-snuba-profiling-profiles-consumer 1
sentry-snuba-replacer 1
sentry-snuba-replays-consumer 1
sentry-snuba-spans-consumer 1
sentry-snuba-subscription-consumer-events 1
sentry-snuba-subscription-consumer-metrics 1
sentry-snuba-subscription-consumer-transactions 1
sentry-snuba-transactions-consumer 1
sentry-subscription-consumer-events 1
sentry-subscription-consumer-generic-metrics 1
sentry-subscription-consumer-metrics 1
sentry-subscription-consumer-transactions 1
sentry-symbolicator-api 1
sentry-vroom 2
sentry-web 3
sentry-worker 5
sentry-worker-events 3
sentry-worker-transactions 5

tuchinsky avatar Feb 08 '25 19:02 tuchinsky

It will be one of your topics with more than 1 partition. One of those topics either doesn't correctly align with the partition count in the snuba config, has more/less partitions than consumers..

Also the topics monitors task and tick should be kept at 1, same for Their associated consumers , at least what i saw from the bug when it was implemented , not sure what version you're on or if it was fixed

seborys40 avatar Feb 08 '25 20:02 seborys40

your replica count should be alligned with counts in TOPIC_PARTITION_COUNTS fyi and also align with actual topic partition counts

seborys40 avatar Feb 08 '25 20:02 seborys40

In some day after thousand infinity restarts this worker stopped crashing, but I didn't make any changes in sentry, snuba or kafka configuration. I restarted it and it continued to work without problems

tuchinsky avatar Feb 20 '25 08:02 tuchinsky

In some day after thousand infinity restarts this worker stopped crashing, but I didn't make any changes in sentry, snuba or kafka configuration. I restarted it and it continued to work without problems

I came to the same situation, and after re-creating topic ### snuba-transactions-commit-log, the Snuba-transactions commit-log resumed normal operation.

7y-9 avatar Mar 07 '25 06:03 7y-9