self-hosted icon indicating copy to clipboard operation
self-hosted copied to clipboard

OffsetOutOfRange errors have returned

Open bobvandevijver opened this issue 10 months ago • 12 comments

Self-Hosted Version

24.3.0

CPU Architecture

x86_64

Docker Version

26.0.0 / 24.0.7

Docker Compose Version

2.25.0 / 2.21.1

Steps to Reproduce

The OffsetOutOfRange errors (discussed before in https://github.com/getsentry/self-hosted/issues/1894) have spontaneously returned on 2 out of my 3 self-hosted installations. This is mostly visible due to the alert no longer being executed.

For one of them, I removed the kafka and zookeeper volumes last week to solve the issue, but it seems that it was only temporary as the errors have returned. The other one only catched my attention just now.

As this might be related to https://github.com/getsentry/self-hosted/issues/2931 and https://github.com/getsentry/self-hosted/issues/2876, I will remove the kafka and zookeeper volumes now again, and replace the rust-consumers with consumer.

I'm also seeing https://github.com/getsentry/snuba/issues/5707 on the other instance, so I will be changed that to the non-rust consumers there as well.

Expected Result

Well, no errors, and events being processed correctly 😄

Actual Result

sentry-self-hosted-post-process-forwarder-errors-1                 | 11:17:54 [INFO] arroyo.processing.processor: Processor terminated
sentry-self-hosted-post-process-forwarder-transactions-1           | 11:17:54 [INFO] arroyo.processing.processor: New partitions assigned: {Partition(topic=Topic(name='transactions'), index=0): 0}
sentry-self-hosted-post-process-forwarder-transactions-1           | 11:17:54 [INFO] sentry.post_process_forwarder.post_process_forwarder: Starting multithreaded post process forwarder
sentry-self-hosted-post-process-forwarder-errors-1                 | Traceback (most recent call last):
sentry-self-hosted-post-process-forwarder-errors-1                 |   File "/usr/local/bin/sentry", line 8, in <module>
sentry-self-hosted-post-process-forwarder-errors-1                 |     sys.exit(main())
sentry-self-hosted-post-process-forwarder-errors-1                 |              ^^^^^^
sentry-self-hosted-post-process-forwarder-errors-1                 |   File "/usr/local/lib/python3.11/site-packages/sentry/runner/__init__.py", line 190, in main
sentry-self-hosted-post-process-forwarder-errors-1                 |     func(**kwargs)
sentry-self-hosted-post-process-forwarder-errors-1                 |   File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
sentry-self-hosted-post-process-forwarder-errors-1                 |     return self.main(*args, **kwargs)
sentry-self-hosted-post-process-forwarder-errors-1                 |            ^^^^^^^^^^^^^^^^^^^^^^^^^^
sentry-self-hosted-post-process-forwarder-errors-1                 |   File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1078, in main
sentry-self-hosted-post-process-forwarder-errors-1                 |     rv = self.invoke(ctx)
sentry-self-hosted-post-process-forwarder-errors-1                 |          ^^^^^^^^^^^^^^^^
sentry-self-hosted-post-process-forwarder-errors-1                 |   File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
sentry-self-hosted-post-process-forwarder-errors-1                 |     return _process_result(sub_ctx.command.invoke(sub_ctx))
sentry-self-hosted-post-process-forwarder-errors-1                 |                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sentry-self-hosted-post-process-forwarder-errors-1                 |   File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
sentry-self-hosted-post-process-forwarder-errors-1                 |     return _process_result(sub_ctx.command.invoke(sub_ctx))
sentry-self-hosted-post-process-forwarder-errors-1                 |                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sentry-self-hosted-post-process-forwarder-errors-1                 |   File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
sentry-self-hosted-post-process-forwarder-errors-1                 |     return ctx.invoke(self.callback, **ctx.params)
sentry-self-hosted-post-process-forwarder-errors-1                 |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sentry-self-hosted-post-process-forwarder-errors-1                 |   File "/usr/local/lib/python3.11/site-packages/click/core.py", line 783, in invoke
sentry-self-hosted-post-process-forwarder-errors-1                 |     return __callback(*args, **kwargs)
sentry-self-hosted-post-process-forwarder-errors-1                 |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
sentry-self-hosted-post-process-forwarder-errors-1                 |   File "/usr/local/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
sentry-self-hosted-post-process-forwarder-errors-1                 |     return f(get_current_context(), *args, **kwargs)
sentry-self-hosted-post-process-forwarder-errors-1                 |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sentry-self-hosted-post-process-forwarder-errors-1                 |   File "/usr/local/lib/python3.11/site-packages/sentry/runner/decorators.py", line 69, in inner
sentry-self-hosted-post-process-forwarder-errors-1                 |     return ctx.invoke(f, *args, **kwargs)
sentry-self-hosted-post-process-forwarder-errors-1                 |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sentry-self-hosted-post-process-forwarder-errors-1                 |   File "/usr/local/lib/python3.11/site-packages/click/core.py", line 783, in invoke
sentry-self-hosted-post-process-forwarder-errors-1                 |     return __callback(*args, **kwargs)
sentry-self-hosted-post-process-forwarder-errors-1                 |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
sentry-self-hosted-post-process-forwarder-errors-1                 |   File "/usr/local/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
sentry-self-hosted-post-process-forwarder-errors-1                 |     return f(get_current_context(), *args, **kwargs)
sentry-self-hosted-post-process-forwarder-errors-1                 |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sentry-self-hosted-post-process-forwarder-errors-1                 |   File "/usr/local/lib/python3.11/site-packages/sentry/runner/decorators.py", line 29, in inner
sentry-self-hosted-post-process-forwarder-errors-1                 |     return ctx.invoke(f, *args, **kwargs)
sentry-self-hosted-post-process-forwarder-errors-1                 |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sentry-self-hosted-post-process-forwarder-errors-1                 |   File "/usr/local/lib/python3.11/site-packages/click/core.py", line 783, in invoke
sentry-self-hosted-post-process-forwarder-errors-1                 |     return __callback(*args, **kwargs)
sentry-self-hosted-post-process-forwarder-errors-1                 |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
sentry-self-hosted-post-process-forwarder-errors-1                 |   File "/usr/local/lib/python3.11/site-packages/sentry/runner/commands/run.py", line 448, in basic_consumer
sentry-self-hosted-post-process-forwarder-errors-1                 |     run_processor_with_signals(processor, consumer_name)
sentry-self-hosted-post-process-forwarder-errors-1                 |   File "/usr/local/lib/python3.11/site-packages/sentry/utils/kafka.py", line 46, in run_processor_with_signals
sentry-self-hosted-post-process-forwarder-errors-1                 |     processor.run()
sentry-self-hosted-post-process-forwarder-errors-1                 |   File "/usr/local/lib/python3.11/site-packages/arroyo/processing/processor.py", line 322, in run
sentry-self-hosted-post-process-forwarder-errors-1                 |     self._run_once()
sentry-self-hosted-post-process-forwarder-errors-1                 |   File "/usr/local/lib/python3.11/site-packages/arroyo/processing/processor.py", line 384, in _run_once
sentry-self-hosted-post-process-forwarder-errors-1                 |     self.__message = self.__consumer.poll(timeout=1.0)
sentry-self-hosted-post-process-forwarder-errors-1                 |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sentry-self-hosted-post-process-forwarder-errors-1                 |   File "/usr/local/lib/python3.11/site-packages/sentry/consumers/synchronized.py", line 235, in poll
sentry-self-hosted-post-process-forwarder-errors-1                 |     message = self.__consumer.poll(timeout)
sentry-self-hosted-post-process-forwarder-errors-1                 |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sentry-self-hosted-post-process-forwarder-errors-1                 |   File "/usr/local/lib/python3.11/site-packages/arroyo/backends/kafka/consumer.py", line 414, in poll
sentry-self-hosted-post-process-forwarder-errors-1                 |     raise OffsetOutOfRange(str(error))
sentry-self-hosted-post-process-forwarder-errors-1                 | arroyo.errors.OffsetOutOfRange: KafkaError{code=_AUTO_OFFSET_RESET,val=-140,str="fetch failed due to requested offset not available on the broker: Broker: Offset out of range (broker 1001)"}
sentry-self-hosted-post-process-forwarder-errors-1 exited with code 0

Event ID

No response

bobvandevijver avatar Apr 10 '24 11:04 bobvandevijver

Same here. So far the same consumer group & topic it seems. Consumer group: post-process-forwarder Topic: events

hostalp avatar Apr 10 '24 13:04 hostalp

I'm sorry I can't hold this back

image

Jokes aside, I can't reproduce this on my end since I don't use Kafka anymore (I replaced it with Redpanda and I got no errors like this). Does this command still works?

sudo docker compose down && \ # We shutdown everything, but we only want to keep Kafka running
sudo docker compose up -d --wait kafka && \
sudo docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --group post-process-forwarder --delete && \  # Delete the post-process-forwarder consumer group
sudo docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --group post-process-forwarder --topic events --delete-offsets && \ # Delete events topic offsets from consumer group named post-process-forwarder on Kafka
sudo docker compose up -d # To start everything again

Let me know if that works.

aldy505 avatar Apr 11 '24 01:04 aldy505

@aldy505 Well, that approach could work too, however what I do in these cases is just a simple offset reset:

docker compose down -v
docker compose --env-file .env.custom up -d kafka
docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --reset-offsets --to-latest --execute --group post-process-forwarder --topic events
docker compose --env-file .env.custom up -d

or you can "optimistically" reset all of them:

docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --reset-offsets --to-latest --execute --all-groups --all-topics

hostalp avatar Apr 11 '24 12:04 hostalp

Unsure why this is only happening for post-process-forwarder, since I don't think that was converted to rust-consumer. Wondering if we are missing a --no-strict-offset-reset on the post-process-forwarder containers. Could you try adding that in the docker compose file? @hostalp @bobvandevijver

hubertdeng123 avatar Apr 12 '24 16:04 hubertdeng123

+1

erfantkerfan avatar Apr 14 '24 07:04 erfantkerfan

It looks like that reverting to the non-rust consumers has fixed it for now: I haven't seen the offset issue return since when I created the ticket and removed the kafka volumes.

bobvandevijver avatar Apr 14 '24 08:04 bobvandevijver

I've on the other hand set the suggested --no-strict-offset-reset flag on all 3 post-process-forwarder consumers, however it may take days or even weeks to find out whether it really helped.

hostalp avatar Apr 15 '24 05:04 hostalp

I also reverted to the non-rust consumers, but today the 3 post-process-forwarder consumers failed again with OffsetOutOfRange errors

I will try to go back to the rust-consumers now, and add --no-strict-offset-reset the the post-process-forwarders

Does this command still works [...] Let me know if that works.

It works :-)

magnuslarsen avatar Apr 16 '24 08:04 magnuslarsen

I've since added --no-strict-offset-reset had no crashes (including the post-process-forwarders), which before adding the option crashed every 5 days or so

For me, this seems to have successfully fixed the issue, with seemingly no side effects :-)

magnuslarsen avatar Apr 29 '24 06:04 magnuslarsen

I concur.

hostalp avatar Apr 29 '24 13:04 hostalp

@hubertdeng123 @azaslavsky Do you think it's safe to put the --no-strict-offset-reset on some of the containers that don't have it by default (as in, hardcoded on the docker-compose.yml)? Can you validate that out with the code owners on Slack? Thanks!

aldy505 avatar Apr 29 '24 14:04 aldy505

~~Note that I did not add the --no-strict-offset-reset option, I only switched to the non-rust consumers. And the error hasn't returned since for us.~~

Update June 4th: The error did return, so now I did add --no-strict-offset-reset and reverted back to the rust consumers.

bobvandevijver avatar Apr 29 '24 14:04 bobvandevijver

I encountered the same issue. Initially, I upgraded to version 24.4.2, but ran into this problem: https://github.com/getsentry/self-hosted/issues/2876. Consequently, after restoring the entire Sentry system, I decided to upgrade to version 24.2.0 since it does not contain any rust-consumer.

Unfortunately, I encountered this issue, which was quite disappointing!

However, echoing what @hubertdeng123 suggested, adding --no-strict-offset-reset to the post-process-forwarder containers resolved the issue for me.

Additionally, I also have same concern with @aldy505 . I don't know it is safe to put --no-strict-offset-reset

liukch avatar Jun 04 '24 03:06 liukch

It should be safe to do, looks like we do that in prod. I'm going to put up a PR to add this option to the post process forwarders.

hubertdeng123 avatar Jun 04 '24 17:06 hubertdeng123

It should be safe to do, looks like we do that in prod. I'm going to put up a PR to add this option to the post process forwarders.

@hubertdeng123 I found this PR was be released in 24.5.1, but because of these issues https://github.com/getsentry/self-hosted/issues/2876 https://github.com/getsentry/snuba/issues/5707, we can not do any upgrade.

liukch avatar Jun 06 '24 06:06 liukch

It should be safe to do, looks like we do that in prod. I'm going to put up a PR to add this option to the post process forwarders.

@hubertdeng123 I found this PR was be released in 24.5.1, but because of these issues #2876 getsentry/snuba#5707, we can not do any upgrade.

@liukch Those two are separate issue. The massive ClickHouse logs don't really cause any ingestion or event issues on the running Sentry instance, so I'm wondering what's happening on your side that made you "can not do any upgrade". Would you please expand about that in a separate (or perhaps more relevant) issue?

aldy505 avatar Jun 06 '24 14:06 aldy505

@aldy505 ClickHouse generates a large number of logs, which is a serious problem in itself. Additionally, while generating a large number of logs, it can also cause transactions not to be accepted, as mentioned in https://github.com/getsentry/self-hosted/issues/2876

liukch avatar Jun 07 '24 02:06 liukch