self-hosted
self-hosted copied to clipboard
We had a spike in errors and after that 100% of errors are getting dropped, could someone help me figure out why?
Self-Hosted Version
24.3.0 unknown
CPU Architecture
x86_64
Docker Version
24.0.7
Docker Compose Version
2.21.0
Steps to Reproduce
On April 8th (Monday) we experienced a spike in errors dropped. There was nothing peculiar going on this day, we didn't receive any complaints of downtime for our web application.
According to the stats page this started at 9am and from April 8th at 9am until today 100% of errors have been dropped.
I have rate limiting set up but that doesnt seem to be the cause as can be seen in screenshots below.
I don't see any warnings in the System Warnings page in the admin panel.
Anybody have any suggestions?
I'd love if Sentry showed a reason as to why the errors were dropped.
Expected Result
Expected errors to not be dropped.
Actual Result
Docker compose logs: https://pastebin.com/raw/TXHJL7i3
Event ID
No response
That is indeed interesting. I'm seeing Net Exception: Socket is not connected, Stack trace
in your clickhouse logs? Maybe your Sentry instance lost connection there?
@hubertdeng123 I'm not sure. It seems like there was a RAM bottleneck along with storage bottleneck. The docker directory ballooned in size to over 60GB. I increased the storage and RAM and reinstalled.
Now Sentry is logging errors, i can see them come in... but in the stats page it shows that there were 32 errors and 32 of them were dropped.
But if i look at the list of issues for this project for the last 7 days i have about 350 pages of issues.
Errors are coming in but Sentry isnt counting them and it's considering them as dropped.
It's quite difficult to debug this remotely - Sentry knows that some errors didn't make it all the way through the pipeline, but that's really all it knows, otherwise they wouldn't be dropped errors. Usually these sorts of things are related to connection issues between various containers (hence the dropping), memory limitations, or configuration at the orchestrator or cloud provider level.
@azaslavsky Do you know if there’s a guide on how to rebuild/reinstall from scratch but retaining data like the projects themselves, user accounts, settings, etc? I don’t care if I lose all of the issues.
Running ./install.sh doesn’t seem to be enough for me, I keep having issues.
Yep, there is a backup/restore tool for exactly this use case: https://develop.sentry.dev/self-hosted/backup/#partial-json-backup
That is indeed interesting. I'm seeing
Net Exception: Socket is not connected, Stack trace
in your clickhouse logs? Maybe your Sentry instance lost connection there?
@hubertdeng123 having this exact issue and getting absolutely spammed by the logs you mention above:
clickhouse-1 | 2024.05.04 21:52:34.085404 [ 281 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, e.displayText() = Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
clickhouse-1 |
clickhouse-1 | 0. Poco::Net::SocketImpl::error(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x13c4ee8e in /usr/bin/clickhouse
clickhouse-1 | 1. Poco::Net::SocketImpl::peerAddress() @ 0x13c510d6 in /usr/bin/clickhouse
clickhouse-1 | 2. DB::ReadBufferFromPocoSocket::ReadBufferFromPocoSocket(Poco::Net::Socket&, unsigned long) @ 0x101540cd in /usr/bin/clickhouse
clickhouse-1 | 3. DB::HTTPServerRequest::HTTPServerRequest(std::__1::shared_ptr<DB::Context const>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x110e6fd5 in /usr/bin/clickhouse
clickhouse-1 | 4. DB::HTTPServerConnection::run() @ 0x110e5d6e in /usr/bin/clickhouse
clickhouse-1 | 5. Poco::Net::TCPServerConnection::start() @ 0x13c5614f in /usr/bin/clickhouse
clickhouse-1 | 6. Poco::Net::TCPServerDispatcher::run() @ 0x13c57bda in /usr/bin/clickhouse
clickhouse-1 | 7. Poco::PooledThread::run() @ 0x13d89e59 in /usr/bin/clickhouse
clickhouse-1 | 8. Poco::ThreadImpl::runnableEntry(void*) @ 0x13d860ea in /usr/bin/clickhouse
clickhouse-1 | 9. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
clickhouse-1 | 10. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
It is not clear to me at all why this started happening. Our instance has run for months without incident and there have been no changes I am aware of. What could cause it to lose connection to clickhouse?
@csvan Have you updated your install recently?
I'm not sure what happened but after updating to version 24.4.2 everything SEEMS to be working fine, I no longer have 100% errors dropped. I didnt change anything on our server.
That is indeed interesting. I'm seeing
Net Exception: Socket is not connected, Stack trace
in your clickhouse logs? Maybe your Sentry instance lost connection there?@hubertdeng123 having this exact issue and getting absolutely spammed by the logs you mention above:
clickhouse-1 | 2024.05.04 21:52:34.085404 [ 281 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, e.displayText() = Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below): clickhouse-1 | clickhouse-1 | 0. Poco::Net::SocketImpl::error(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x13c4ee8e in /usr/bin/clickhouse clickhouse-1 | 1. Poco::Net::SocketImpl::peerAddress() @ 0x13c510d6 in /usr/bin/clickhouse clickhouse-1 | 2. DB::ReadBufferFromPocoSocket::ReadBufferFromPocoSocket(Poco::Net::Socket&, unsigned long) @ 0x101540cd in /usr/bin/clickhouse clickhouse-1 | 3. DB::HTTPServerRequest::HTTPServerRequest(std::__1::shared_ptr<DB::Context const>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x110e6fd5 in /usr/bin/clickhouse clickhouse-1 | 4. DB::HTTPServerConnection::run() @ 0x110e5d6e in /usr/bin/clickhouse clickhouse-1 | 5. Poco::Net::TCPServerConnection::start() @ 0x13c5614f in /usr/bin/clickhouse clickhouse-1 | 6. Poco::Net::TCPServerDispatcher::run() @ 0x13c57bda in /usr/bin/clickhouse clickhouse-1 | 7. Poco::PooledThread::run() @ 0x13d89e59 in /usr/bin/clickhouse clickhouse-1 | 8. Poco::ThreadImpl::runnableEntry(void*) @ 0x13d860ea in /usr/bin/clickhouse clickhouse-1 | 9. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so clickhouse-1 | 10. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
It is not clear to me at all why this started happening. Our instance has run for months without incident and there have been no changes I am aware of. What could cause it to lose connection to clickhouse?
~~Same errors here (24.4.0 and nightly), though it seems it does not affect ingestion or sentry general working status~~
Sorry I just discovered this https://github.com/getsentry/self-hosted/issues/2978. Migrating back to plain consumer
instead of rust-consumer
fixed the log spam
I got also this situation, where 100% of issues were dropped, when I upgraded 23.11.2 -> 24.3.0
. After I upgraded 24.3.0 -> 24.5.0
, everything seems to be normal according to stats page.