self-hosted icon indicating copy to clipboard operation
self-hosted copied to clipboard

We had a spike in errors and after that 100% of errors are getting dropped, could someone help me figure out why?

Open edgariscoding opened this issue 10 months ago • 10 comments

Self-Hosted Version

24.3.0 unknown

CPU Architecture

x86_64

Docker Version

24.0.7

Docker Compose Version

2.21.0

Steps to Reproduce

On April 8th (Monday) we experienced a spike in errors dropped. There was nothing peculiar going on this day, we didn't receive any complaints of downtime for our web application.

image

image

According to the stats page this started at 9am and from April 8th at 9am until today 100% of errors have been dropped.

I have rate limiting set up but that doesnt seem to be the cause as can be seen in screenshots below.

I don't see any warnings in the System Warnings page in the admin panel.

Anybody have any suggestions?

I'd love if Sentry showed a reason as to why the errors were dropped.

Expected Result

Expected errors to not be dropped.

Actual Result

Docker compose logs: https://pastebin.com/raw/TXHJL7i3

image image image image

Event ID

No response

edgariscoding avatar Apr 15 '24 16:04 edgariscoding

That is indeed interesting. I'm seeing Net Exception: Socket is not connected, Stack trace in your clickhouse logs? Maybe your Sentry instance lost connection there?

hubertdeng123 avatar Apr 16 '24 22:04 hubertdeng123

@hubertdeng123 I'm not sure. It seems like there was a RAM bottleneck along with storage bottleneck. The docker directory ballooned in size to over 60GB. I increased the storage and RAM and reinstalled.

Now Sentry is logging errors, i can see them come in... but in the stats page it shows that there were 32 errors and 32 of them were dropped. image

But if i look at the list of issues for this project for the last 7 days i have about 350 pages of issues.

Errors are coming in but Sentry isnt counting them and it's considering them as dropped.

edgariscoding avatar Apr 22 '24 18:04 edgariscoding

It's quite difficult to debug this remotely - Sentry knows that some errors didn't make it all the way through the pipeline, but that's really all it knows, otherwise they wouldn't be dropped errors. Usually these sorts of things are related to connection issues between various containers (hence the dropping), memory limitations, or configuration at the orchestrator or cloud provider level.

azaslavsky avatar Apr 23 '24 21:04 azaslavsky

@azaslavsky Do you know if there’s a guide on how to rebuild/reinstall from scratch but retaining data like the projects themselves, user accounts, settings, etc? I don’t care if I lose all of the issues.

Running ./install.sh doesn’t seem to be enough for me, I keep having issues.

edgariscoding avatar Apr 24 '24 17:04 edgariscoding

Yep, there is a backup/restore tool for exactly this use case: https://develop.sentry.dev/self-hosted/backup/#partial-json-backup

azaslavsky avatar Apr 25 '24 21:04 azaslavsky

That is indeed interesting. I'm seeing Net Exception: Socket is not connected, Stack trace in your clickhouse logs? Maybe your Sentry instance lost connection there?

@hubertdeng123 having this exact issue and getting absolutely spammed by the logs you mention above:

clickhouse-1                                    | 2024.05.04 21:52:34.085404 [ 281 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, e.displayText() = Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
clickhouse-1                                    |
clickhouse-1                                    | 0. Poco::Net::SocketImpl::error(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x13c4ee8e in /usr/bin/clickhouse
clickhouse-1                                    | 1. Poco::Net::SocketImpl::peerAddress() @ 0x13c510d6 in /usr/bin/clickhouse
clickhouse-1                                    | 2. DB::ReadBufferFromPocoSocket::ReadBufferFromPocoSocket(Poco::Net::Socket&, unsigned long) @ 0x101540cd in /usr/bin/clickhouse
clickhouse-1                                    | 3. DB::HTTPServerRequest::HTTPServerRequest(std::__1::shared_ptr<DB::Context const>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x110e6fd5 in /usr/bin/clickhouse
clickhouse-1                                    | 4. DB::HTTPServerConnection::run() @ 0x110e5d6e in /usr/bin/clickhouse
clickhouse-1                                    | 5. Poco::Net::TCPServerConnection::start() @ 0x13c5614f in /usr/bin/clickhouse
clickhouse-1                                    | 6. Poco::Net::TCPServerDispatcher::run() @ 0x13c57bda in /usr/bin/clickhouse
clickhouse-1                                    | 7. Poco::PooledThread::run() @ 0x13d89e59 in /usr/bin/clickhouse
clickhouse-1                                    | 8. Poco::ThreadImpl::runnableEntry(void*) @ 0x13d860ea in /usr/bin/clickhouse
clickhouse-1                                    | 9. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
clickhouse-1                                    | 10. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so

It is not clear to me at all why this started happening. Our instance has run for months without incident and there have been no changes I am aware of. What could cause it to lose connection to clickhouse?

csvan avatar May 04 '24 22:05 csvan

@csvan Have you updated your install recently?

azaslavsky avatar May 07 '24 20:05 azaslavsky

I'm not sure what happened but after updating to version 24.4.2 everything SEEMS to be working fine, I no longer have 100% errors dropped. I didnt change anything on our server.

edgariscoding avatar May 08 '24 23:05 edgariscoding

That is indeed interesting. I'm seeing Net Exception: Socket is not connected, Stack trace in your clickhouse logs? Maybe your Sentry instance lost connection there?

@hubertdeng123 having this exact issue and getting absolutely spammed by the logs you mention above:

clickhouse-1                                    | 2024.05.04 21:52:34.085404 [ 281 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, e.displayText() = Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
clickhouse-1                                    |
clickhouse-1                                    | 0. Poco::Net::SocketImpl::error(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x13c4ee8e in /usr/bin/clickhouse
clickhouse-1                                    | 1. Poco::Net::SocketImpl::peerAddress() @ 0x13c510d6 in /usr/bin/clickhouse
clickhouse-1                                    | 2. DB::ReadBufferFromPocoSocket::ReadBufferFromPocoSocket(Poco::Net::Socket&, unsigned long) @ 0x101540cd in /usr/bin/clickhouse
clickhouse-1                                    | 3. DB::HTTPServerRequest::HTTPServerRequest(std::__1::shared_ptr<DB::Context const>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x110e6fd5 in /usr/bin/clickhouse
clickhouse-1                                    | 4. DB::HTTPServerConnection::run() @ 0x110e5d6e in /usr/bin/clickhouse
clickhouse-1                                    | 5. Poco::Net::TCPServerConnection::start() @ 0x13c5614f in /usr/bin/clickhouse
clickhouse-1                                    | 6. Poco::Net::TCPServerDispatcher::run() @ 0x13c57bda in /usr/bin/clickhouse
clickhouse-1                                    | 7. Poco::PooledThread::run() @ 0x13d89e59 in /usr/bin/clickhouse
clickhouse-1                                    | 8. Poco::ThreadImpl::runnableEntry(void*) @ 0x13d860ea in /usr/bin/clickhouse
clickhouse-1                                    | 9. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
clickhouse-1                                    | 10. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so

It is not clear to me at all why this started happening. Our instance has run for months without incident and there have been no changes I am aware of. What could cause it to lose connection to clickhouse?

~~Same errors here (24.4.0 and nightly), though it seems it does not affect ingestion or sentry general working status~~

Sorry I just discovered this https://github.com/getsentry/self-hosted/issues/2978. Migrating back to plain consumer instead of rust-consumer fixed the log spam

yakky avatar May 09 '24 00:05 yakky

I got also this situation, where 100% of issues were dropped, when I upgraded 23.11.2 -> 24.3.0. After I upgraded 24.3.0 -> 24.5.0, everything seems to be normal according to stats page.

Tha-Fox avatar May 27 '24 07:05 Tha-Fox