snuba-subscription-consumer-* containers are failing continuously
Self-Hosted Version
23.11.2
CPU Architecture
x86_64
Docker Version
NA
Docker Compose Version
NA
Steps to Reproduce
Seeing following containers been crashing continuously. Is this services used for alerting ? Have little confusions now on the services functionality.
snuba-subscription-consumer-events
snuba-subscription-consumer-metrics
snuba-subscription-consumer-transactions
Logs:
2024-07-20 15:57:07,088 Initializing Snuba...
2024-07-20 15:57:10,884 Snuba initialization took 3.7952772620010364s
{"module": "builtins", "event": "Checking Clickhouse connections", "severity": "info", "timestamp": "2024-07-20T15:57:10.897290Z"}
2024-07-20 15:57:10,966 New partitions assigned: {Partition(topic=Topic(name='snuba-commit-log'), index=0): 0, Partition(topic=Topic(name='snuba-commit-log'), index=1): 0, Partition(topic=Topic(name='snuba-commit-log'), index=2): 0, Partition(topic=Topic(name='snuba-commit-log'), index=3): 0, Partition(topic=Topic(name='snuba-commit-log'), index=4): 0}
2024-07-20 15:57:10,979 Caught exception, shutting down...
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 294, in run
self._run_once()
File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 382, in _run_once
self.__processing_strategy.submit(message)
File "/usr/src/snuba/snuba/subscriptions/scheduler_processing_strategy.py", line 240, in submit
self.__next_step.submit(message)
File "/usr/src/snuba/snuba/subscriptions/combined_scheduler_executor.py", line 275, in submit
tasks.extend([task for task in entity_scheduler[tick.partition].find(tick)])
KeyError: 2
2024-07-20 15:57:10,981 Closing <snuba.subscriptions.scheduler_consumer.CommitLogTickConsumer object at 0x7cba41de5f70>...
2024-07-20 15:57:10,983 Partitions to revoke: [Partition(topic=Topic(name='snuba-commit-log'), index=0), Partition(topic=Topic(name='snuba-commit-log'), index=1), Partition(topic=Topic(name='snuba-commit-log'), index=2), Partition(topic=Topic(name='snuba-commit-log'), index=3), Partition(topic=Topic(name='snuba-commit-log'), index=4)]
2024-07-20 15:57:10,983 Partition revocation complete.
2024-07-20 15:57:10,987 Processor terminated
Traceback (most recent call last):
File "/usr/local/bin/snuba", line 33, in <module>
sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/usr/src/snuba/snuba/cli/subscriptions_scheduler_executor.py", line 153, in subscriptions_scheduler_executor
processor.run()
File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 294, in run
self._run_once()
File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 382, in _run_once
self.__processing_strategy.submit(message)
File "/usr/src/snuba/snuba/subscriptions/scheduler_processing_strategy.py", line 240, in submit
self.__next_step.submit(message)
File "/usr/src/snuba/snuba/subscriptions/combined_scheduler_executor.py", line 275, in submit
tasks.extend([task for task in entity_scheduler[tick.partition].find(tick)])
KeyError: 2
Alerting system were working fine. We made few changes with kafka partitions after that we saw only these 3 containers were down.
- Initially increased kafka partition for
ingest-eventsandeventsfrom 1 to 5 for scale testing - We saw only these 3 services were getting down with above error
- Followed the steps updated in this issue https://github.com/getsentry/self-hosted/issues/2067
- Tried to clear all lags and offset, still didnt worked out
- Recreated the topics as per solution mentioned in above issue, it didnt worked out.
- We deleted all existing alerts and recreated, now alerts are working. But still the containers are in failed state.
Have little confusions now on these services functionality. Which service is now serving the alerting ?
Suspecting some issue with partition mis-match(please do correct us if this is not related to it), so have increased all the topics partition to 5. Currently review all topic configs, seeing these 3 topics snuba-commit-log, events-subscription-results and ingest-monitors having a ReplicationFactor of 3 rest all topic is having ReplicationFactor as 1, remaining all configs remains same now.
Also while listing out consumer-groups seeing following having no active members
Consumer group 'snuba-transactions-subscriptions-consumers' has no active members.
Consumer group 'snuba-events-subscriptions-consumers' has no active members.
Consumer group 'sentry-commit-log-6e1d91f6451a11ef8ad962551908ad8e' has no active members.
Consumer group 'nuba-metrics-subscriptions-consumers' has no active members.
Consumer group 'sentry-commit-log-12e82a30451a11efb933c2a760684d4c' has no active members.
Do let us know if any other information needed.
Expected Result
NA
Actual Result
NA
Event ID
No response
I think this may be a duplicate of https://github.com/getsentry/snuba/issues/5855#issuecomment-2113149084
Any updates on this? I was able to fix snuba-subscription-consumer-transactions by recreating the corresponding topic but for snuba-subscription-consumer-metrics , that did not work
@mcannizz
Is this simply attributed to Sentry currently not polling this data as it's not needed? Hence kafka states no active members? Still the crashloop should be fixed.
UPDATE: However, once I reset offset it does have active members and the problem persists. (Or when the pod is not crashing)
Please refer to the issue comment I linked above and ensure you are not running the commit-log topic with more than one partition.
I have/had all commit-log topics set to 1 partition and 1 replica all along as well have them all defined in the topic_partition_counts and all set to 1. This problem became apparent when I was doing some regular operations like editing the topic_partition_counts and updating basic configs.
I can't get passed this events is fine metrics and transactions is not.
Sorry logged in another account ** events is fine, transactions is fine but metrics is not.
@untitaker , is this at all related to removal of beta metrics feature from sentry? We recently did remove the metrics beta feature . I wonder if some components can now be removed from deployment... although I see the metric topics are still having data sent to them.
what are these exactly responsible for ?
- sentry-snuba-subscription-consumer-metrics
- sentry-snuba-subscription-consumer-transactions
- sentry-snuba-subscription-consumer-events
@chipzzz metrics is for release health (crashed sessions etc in releases tab), generic-metrics is for the beta metrics feature you mention, transactions is for Performance product in general, events is for errors
generally, deployments with "subscription" in the name are for alerts. if you don't need alerts on crashed sessions/performance data/errors respectively, you can just remove those deployments
if you have further questions like this I suggest filing a separate issue from this one, which should be focused on the bugs IMO
@untitaker , Am still using the aforementioned, except beta metrics. Unclear though what else could be causing this.
This may be related to this https://github.com/getsentry/self-hosted/pull/3106#issue-2332027811
For reference including https://github.com/getsentry/snuba/pull/2666
This was removed https://github.com/getsentry/snuba/pull/3623 but still referenced here https://github.com/getsentry/snuba/blob/24.7.1/snuba/subscriptions/scheduler_processing_strategy.py#L210
Resolved the issue.
These consumers
- sentry-snuba-subscription-consumer-metrics
- sentry-snuba-subscription-consumer-transactions
- sentry-snuba-subscription-consumer-events
Also depend on other topics and not just commit-log topics, these are
- snuba-metrics
- transactions
- events
In my UAT environment I had an increased number of partitions for these topics but did not have a matching number of consumers to consume from all partitions, hence the key error.
So you must have a matching number of partitions to corresponding consumers consuming them. I tested with other topics/consumers but was not aware snuba-metrics, transactions, events topic were also associated.
However, I am not sure how this problem became apparent as It was always set up this way.
So you must have a matching number of partitions to corresponding consumers consuming them. I tested with other topics/consumers but was not aware snuba-metrics, transactions, events topic were also associated.
@chipzzz can you explain what you mean?
I have 1 partition with 1 replica for "snuba-transactions-commit-log" и "transactions" topics, but worker "sentry-snuba-subscription-consumer-transactions" still crashing
File "/usr/local/bin/snuba", line 33, in <module>
sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/src/snuba/snuba/cli/subscriptions_scheduler_executor.py", line 153, in subscriptions_scheduler_executor
processor.run()
File "/usr/local/lib/python3.11/site-packages/arroyo/processing/processor.py", line 322, in run
self._run_once()
File "/usr/local/lib/python3.11/site-packages/arroyo/processing/processor.py", line 410, in _run_once
self.__processing_strategy.submit(message)
File "/usr/src/snuba/snuba/subscriptions/scheduler_processing_strategy.py", line 252, in submit
self.__next_step.submit(message)
File "/usr/src/snuba/snuba/subscriptions/combined_scheduler_executor.py", line 275, in submit
tasks.extend([task for task in entity_scheduler[tick.partition].find(tick)])
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: 8
@tuchinsky , make sure you also define the partition count in snuba config so it's aware.
https://stackoverflow.com/questions/73294110/keyerror-1-sentry-snuba-subscription-consumer-events
but essentially yes the partition count must match the consumer count. Given you hav key error of 8 look at your topics or consumers that have at least 8 partitions or you have atleast 8 consumers or more somewhere. So Verfiy the topics and verify the snuba config and verify consumer count, make sure that counts all align.
@seborys40 I didn't have this parameter in the config. I check topics configuration in kafka and then added it to snuba settings, but nothing changed - only sentry-snuba-subscription-consumer-transactions worker still crashing with the same error
TOPIC_PARTITION_COUNTS = {
"buffered-segments": 20,
"buffered-segments-dlq": 20,
"cdc": 1,
"event-replacements": 1,
"events": 1,
"events-subscription-results": 1,
"generic-events": 1,
"generic-metrics-subscription-results": 1,
"group-attributes": 1,
"ingest-attachments": 20,
"ingest-attachments-dlq": 20,
"ingest-events": 20,
"ingest-events-dlq": 20,
"ingest-feedback-events": 20,
"ingest-feedback-events-dlq": 20,
"ingest-generic-metrics-dlq": 20,
"ingest-metrics": 20,
"ingest-metrics-dlq": 20,
"ingest-monitors": 20,
"ingest-occurrences": 20,
"ingest-performance-metrics": 50,
"ingest-replay-events": 1,
"ingest-replay-recordings": 20,
"ingest-transactions": 20,
"ingest-transactions-dlq": 20,
"metrics-subscription-results": 1,
"monitors-clock-tasks": 20,
"monitors-clock-tick": 20,
"outcomes": 1,
"outcomes-billing": 20,
"processed-profiles": 1,
"profiles": 20,
"profiles-call-tree": 1,
"scheduled-subscriptions-events": 1,
"scheduled-subscriptions-generic-metrics-counters": 1,
"scheduled-subscriptions-generic-metrics-distributions": 1,
"scheduled-subscriptions-generic-metrics-gauges": 1,
"scheduled-subscriptions-generic-metrics-sets": 1,
"scheduled-subscriptions-metrics": 1,
"scheduled-subscriptions-transactions": 1,
"shared-resources-usage": 1,
"snuba-commit-log": 1,
"snuba-dead-letter-generic-events": 1,
"snuba-dead-letter-generic-metrics": 1,
"snuba-dead-letter-group-attributes": 1,
"snuba-dead-letter-metrics": 1,
"snuba-dead-letter-querylog": 1,
"snuba-dead-letter-replays": 1,
"snuba-generic-events-commit-log": 1,
"snuba-generic-metrics": 1,
"snuba-generic-metrics-counters-commit-log": 1,
"snuba-generic-metrics-distributions-commit-log": 1,
"snuba-generic-metrics-gauges-commit-log": 1,
"snuba-generic-metrics-sets-commit-log": 1,
"snuba-metrics": 1,
"snuba-metrics-commit-log": 1,
"snuba-metrics-summaries": 1,
"snuba-profile-chunks": 1,
"snuba-queries": 1,
"snuba-spans": 1,
"snuba-transactions-commit-log": 1,
"transactions": 1,
"transactions-subscription-results": 1,
"uptime-configs": 20,
"uptime-results": 20
}
my containers replica count:
sentry-billing-metrics-consumer 1
sentry-cron 1
sentry-generic-metrics-consumer 5
sentry-ingest-consumer-attachments 5
sentry-ingest-consumer-events 5
sentry-ingest-consumer-transactions 5
sentry-ingest-monitors 3
sentry-ingest-occurrences 3
sentry-ingest-profiles 5
sentry-ingest-replay-recordings 3
sentry-metrics 1
sentry-metrics-consumer 3
sentry-post-process-forward-errors 1
sentry-post-process-forward-issue-platform 1
sentry-post-process-forward-transactions 1
sentry-relay 2
sentry-snuba-api 3
sentry-snuba-consumer 1
sentry-snuba-generic-metrics-counters-consumer 1
sentry-snuba-generic-metrics-distributions-consumer 1
sentry-snuba-generic-metrics-sets-consumer 1
sentry-snuba-group-attributes-consumer 1
sentry-snuba-issue-occurrence-consumer 1
sentry-snuba-metrics-consumer 1
sentry-snuba-outcomes-billing-consumer 5
sentry-snuba-outcomes-consumer 1
sentry-snuba-profiling-functions-consumer 1
sentry-snuba-profiling-profiles-consumer 1
sentry-snuba-replacer 1
sentry-snuba-replays-consumer 1
sentry-snuba-spans-consumer 1
sentry-snuba-subscription-consumer-events 1
sentry-snuba-subscription-consumer-metrics 1
sentry-snuba-subscription-consumer-transactions 1
sentry-snuba-transactions-consumer 1
sentry-subscription-consumer-events 1
sentry-subscription-consumer-generic-metrics 1
sentry-subscription-consumer-metrics 1
sentry-subscription-consumer-transactions 1
sentry-symbolicator-api 1
sentry-vroom 2
sentry-web 3
sentry-worker 5
sentry-worker-events 3
sentry-worker-transactions 5
It will be one of your topics with more than 1 partition. One of those topics either doesn't correctly align with the partition count in the snuba config, has more/less partitions than consumers..
Also the topics monitors task and tick should be kept at 1, same for Their associated consumers , at least what i saw from the bug when it was implemented , not sure what version you're on or if it was fixed
your replica count should be alligned with counts in TOPIC_PARTITION_COUNTS fyi and also align with actual topic partition counts
In some day after thousand infinity restarts this worker stopped crashing, but I didn't make any changes in sentry, snuba or kafka configuration. I restarted it and it continued to work without problems
In some day after thousand infinity restarts this worker stopped crashing, but I didn't make any changes in sentry, snuba or kafka configuration. I restarted it and it continued to work without problems
I came to the same situation, and after re-creating topic ### snuba-transactions-commit-log, the Snuba-transactions commit-log resumed normal operation.