self-hosted icon indicating copy to clipboard operation
self-hosted copied to clipboard

snuba-subscription-consumer-events restarts with different KeyError:

Open arrmor opened this issue 9 months ago • 5 comments

Self-Hosted Version

23.1.1

CPU Architecture

x86_64

Docker Version

18.03.1-ce

Docker Compose Version

1.29.2

Machine Specification

  • [x] My system meets the minimum system requirements of Sentry

Steps to Reproduce

  1. run docker-compose sentry v23.1.1 on more than 1 VM(server)
  2. change kafka partitions for topic snuba-commit-log
  3. change TOPIC_PARTITION_COUNTS to represent actual number of partitions for the given kafka topic
  4. run snuba-subscription-consumer-events

Expected Result

snuba-subscription-consumer-events service runs without errors and restarts

Actual Result

I've read somewhere that the type of error I've encountered is somehow connected with recent changes in kafka partition number for the topic.

So I've got snuba-subscription-consumer-events service constantly restarting with such errors:

Click to expand logs

2025-02-07 15:18:06,869 Initializing Snuba...
2025-02-07 15:18:10,564 Snuba initialization took 3.6961687952280045s
2025-02-07 15:18:11,149 Initializing Snuba...
2025-02-07 15:18:14,786 Snuba initialization took 3.637780174612999s
2025-02-07 15:18:17,515 New partitions assigned: {Partition(topic=Topic(name='snuba-commit-log'), index=0): 40134965}
2025-02-07 15:18:17,530 Caught exception, shutting down...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 175, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 225, in _run_once
    self.__processing_strategy.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/scheduler_processing_strategy.py", line 240, in submit
    self.__next_step.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/combined_scheduler_executor.py", line 246, in submit
    tasks.extend([task for task in entity_scheduler[tick.partition].find(tick)])
KeyError: 10
2025-02-07 15:18:17,532 Closing <snuba.subscriptions.scheduler_consumer.CommitLogTickConsumer object at 0x7efcd0610250>...
2025-02-07 15:18:17,533 Partitions revoked: [Partition(topic=Topic(name='snuba-commit-log'), index=0)]
2025-02-07 15:18:17,535 Processor terminated
Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/subscriptions_scheduler_executor.py", line 140, in subscriptions_scheduler_executor
    processor.run()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 175, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 225, in _run_once
    self.__processing_strategy.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/scheduler_processing_strategy.py", line 240, in submit
    self.__next_step.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/combined_scheduler_executor.py", line 246, in submit
    tasks.extend([task for task in entity_scheduler[tick.partition].find(tick)])
KeyError: 10
2025-02-07 15:18:26,443 Initializing Snuba...
2025-02-07 15:18:30,162 Snuba initialization took 3.7202555276453495s
2025-02-07 15:18:30,754 Initializing Snuba...
2025-02-07 15:18:34,535 Snuba initialization took 3.7820699214935303s
2025-02-07 15:18:35,411 New partitions assigned: {Partition(topic=Topic(name='snuba-commit-log'), index=1): 104750374}
2025-02-07 15:18:35,418 Caught exception, shutting down...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 175, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 225, in _run_once
    self.__processing_strategy.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/scheduler_processing_strategy.py", line 240, in submit
    self.__next_step.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/combined_scheduler_executor.py", line 246, in submit
    tasks.extend([task for task in entity_scheduler[tick.partition].find(tick)])
KeyError: 3
2025-02-07 15:18:35,420 Closing <snuba.subscriptions.scheduler_consumer.CommitLogTickConsumer object at 0x7f0ed5eee250>...
2025-02-07 15:18:35,420 Partitions revoked: [Partition(topic=Topic(name='snuba-commit-log'), index=1)]
2025-02-07 15:18:35,425 Processor terminated
Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/subscriptions_scheduler_executor.py", line 140, in subscriptions_scheduler_executor
    processor.run()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 175, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 225, in _run_once
    self.__processing_strategy.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/scheduler_processing_strategy.py", line 240, in submit
    self.__next_step.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/combined_scheduler_executor.py", line 246, in submit
    tasks.extend([task for task in entity_scheduler[tick.partition].find(tick)])
KeyError: 3
2025-02-07 15:18:44,927 Initializing Snuba...
2025-02-07 15:18:48,769 Snuba initialization took 3.842433415353298s
2025-02-07 15:18:49,370 Initializing Snuba...
2025-02-07 15:18:53,027 Snuba initialization took 3.6579839438199997s
2025-02-07 15:18:54,882 New partitions assigned: {Partition(topic=Topic(name='snuba-commit-log'), index=4): 55029728}
2025-02-07 15:18:54,896 Caught exception, shutting down...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 175, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 225, in _run_once
    self.__processing_strategy.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/scheduler_processing_strategy.py", line 240, in submit
    self.__next_step.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/combined_scheduler_executor.py", line 246, in submit
    tasks.extend([task for task in entity_scheduler[tick.partition].find(tick)])
KeyError: 2
2025-02-07 15:18:54,898 Closing <snuba.subscriptions.scheduler_consumer.CommitLogTickConsumer object at 0x7f3e80843250>...
2025-02-07 15:18:54,898 Partitions revoked: [Partition(topic=Topic(name='snuba-commit-log'), index=4)]
2025-02-07 15:18:54,902 Processor terminated
Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/subscriptions_scheduler_executor.py", line 140, in subscriptions_scheduler_executor
    processor.run()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 175, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 225, in _run_once
    self.__processing_strategy.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/scheduler_processing_strategy.py", line 240, in submit
    self.__next_step.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/combined_scheduler_executor.py", line 246, in submit
    tasks.extend([task for task in entity_scheduler[tick.partition].find(tick)])
KeyError: 2
2025-02-07 15:19:02,043 Initializing Snuba...
2025-02-07 15:19:05,724 Snuba initialization took 3.6818041875958443s
2025-02-07 15:19:06,300 Initializing Snuba...
2025-02-07 15:19:09,999 Snuba initialization took 3.6995532028377056s
2025-02-07 15:19:13,070 New partitions assigned: {Partition(topic=Topic(name='snuba-commit-log'), index=4): 55029728}
2025-02-07 15:19:13,086 Caught exception, shutting down...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 175, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 225, in _run_once
    self.__processing_strategy.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/scheduler_processing_strategy.py", line 240, in submit
    self.__next_step.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/combined_scheduler_executor.py", line 246, in submit
    tasks.extend([task for task in entity_scheduler[tick.partition].find(tick)])
KeyError: 2
2025-02-07 15:19:13,087 Closing <snuba.subscriptions.scheduler_consumer.CommitLogTickConsumer object at 0x7fc729bf0250>...
2025-02-07 15:19:13,088 Partitions revoked: [Partition(topic=Topic(name='snuba-commit-log'), index=4)]
2025-02-07 15:19:13,089 Processor terminated
Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/subscriptions_scheduler_executor.py", line 140, in subscriptions_scheduler_executor
    processor.run()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 175, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 225, in _run_once
    self.__processing_strategy.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/scheduler_processing_strategy.py", line 240, in submit
    self.__next_step.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/combined_scheduler_executor.py", line 246, in submit
    tasks.extend([task for task in entity_scheduler[tick.partition].find(tick)])
KeyError: 2

KeyError number can be different. I've seen 2,4,14,16 :)

My snuba-config/setting.py contains TOPIC_PARTITION_COUNTS as follows:

***
TOPIC_PARTITION_COUNTS = Mapping[str, int] = {
    "event-replacements": 5,
    "events": 20,
    "events-subscription-results": 5,
    "ingest-sessions": 5,
    "outcomes": 5,
    "snuba-commit-log": 5,
    "snuba-transactions-commit-log": 5,
    "transactions-subscription-results": 5
}

in docker-compose I also have similar configs. Added to see mb service can read only one or another option.

***
  snuba-subscription-consumer-events:
  ***
  TOPIC_PARTITION_COUNTS: '{"event-replacements": 5,"events": 20,"events-subscription-results": 5,"ingest-sessions": 5,"outcomes": 5,"snuba-commit-log": 5,"snuba-transactions-commit-log": 5,"transactions-subscription-results": 5}'
command: subscriptions-scheduler-executor --dataset events --entity events --auto-offset-reset=latest --no-strict-offset-reset --consumer-group=snuba-events-subscriptions-consumers --followed-consumer-group=snuba-consumers --delay-seconds=60 --schedule-ttl=60 --stale-threshold-seconds=900
***

And my kafka group\topic is constantly in state rebalancing or has no active members lag is growing. Had 5 partitions as you can see below:


Warning: Consumer group 'snuba-events-subscriptions-consumers' is rebalancing.

GROUP                                TOPIC            PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
snuba-events-subscriptions-consumers snuba-commit-log 4          55029728        55030657        929             -               -               -
snuba-events-subscriptions-consumers snuba-commit-log 1          104750374       104751516       1142            -               -               -
snuba-events-subscriptions-consumers snuba-commit-log 3          87189654        87193048        3394            -               -               -
snuba-events-subscriptions-consumers snuba-commit-log 2          73663948        73665379        1431            -               -               -
snuba-events-subscriptions-consumers snuba-commit-log 0          40134965        40135652        687             -               -               -

snuba-transactions-commit-log lag is also growing, but i guess it is a different issue. Consumer containers for that group assigned to a correct partition indexes and don't restarting.

Event ID

No response

arrmor avatar Feb 07 '25 15:02 arrmor

any suggestions would be awesome :)

arrmor avatar Feb 20 '25 08:02 arrmor

same issue

Self-Hosted Version 25.2.0

CPU Architecture x86_64

Docker Version Docker version 27.5.1, build 9f9e405

Docker Compose Version Docker Compose version v2.32.4-desktop.1

Machine Specification MacBook Pro(16 英寸,2019 年) 处理器:2.4 GHz 八核Intel Core i9 内存:32 GB 2667 MHz DDR4

My system meets the minimum system requirements of Sentry

logs:⬇️

Docker images built.

▶ Bootstrapping and migrating Snuba ... Network sentry-self-hosted_default Creating Network sentry-self-hosted_default Created Container sentry-self-hosted-redis-1 Creating Container sentry-self-hosted-kafka-1 Creating Container sentry-self-hosted-clickhouse-1 Creating Container sentry-self-hosted-clickhouse-1 Created Container sentry-self-hosted-redis-1 Created Container sentry-self-hosted-kafka-1 Created Container sentry-self-hosted-clickhouse-1 Starting Container sentry-self-hosted-kafka-1 Starting Container sentry-self-hosted-redis-1 Starting Container sentry-self-hosted-kafka-1 Started Container sentry-self-hosted-clickhouse-1 Started Container sentry-self-hosted-redis-1 Started 2025-02-26 13:33:19,950 Initializing Snuba... 2025-02-26 13:33:24,048 Snuba initialization took 4.098573600000236s 2025-02-26 13:33:24,656 Initializing Snuba... 2025-02-26 13:33:28,653 Snuba initialization took 3.9985657639990677s 2025-02-26 13:33:28,662 Attempting to connect to Kafka (attempt 0)... 2025-02-26 13:33:28,691 Connected to Kafka on attempt 0 2025-02-26 13:33:28,769 Creating Kafka topics... Traceback (most recent call last): File "/usr/local/bin/snuba", line 33, in sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1157, in call return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/snuba/snuba/cli/bootstrap.py", line 89, in bootstrap Runner().run_all(force=True) File "/usr/src/snuba/snuba/migrations/runner.py", line 220, in run_all pending_migrations = self._get_pending_migrations() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/snuba/snuba/migrations/runner.py", line 462, in _get_pending_migrations group_migrations = self._get_pending_migrations_for_group(group) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/snuba/snuba/migrations/runner.py", line 485, in _get_pending_migrations_for_group raise MigrationInProgress(str(migration_key)) snuba.migrations.errors.MigrationInProgress: generic_metrics: 0041_adjust_partitioning_meta_tables Error in install/bootstrap-snuba.sh:7. '$dcr snuba-api bootstrap --force' exited with status 1 -> ./install.sh:main:34 --> install/bootstrap-snuba.sh:source:7

Cleaning up...

qidasheng369 avatar Feb 26 '25 13:02 qidasheng369

Heya @untitaker can you take a look at this? Thanks!

aldy505 avatar Jun 05 '25 03:06 aldy505

@phacops @onewland i think this is for your team

untitaker avatar Jun 05 '25 07:06 untitaker

Well. I dealt with this issue by recreating all topics with partition number of 1. That way you don't have to maintain TOPIC_PARTITION_COUNTS anymore.

And iirc cleanup of old records from tables where partition != 0 or just sentry-cleanup of old data.

doc-sheet avatar Jun 05 '25 18:06 doc-sheet