Relay Server: Not Enough Memory on Health Check even though stats show otherwise
Self-Hosted Version
24.8.0
CPU Architecture
x86_64
Docker Version
27.2.1
Docker Compose Version
2.29.2
Steps to Reproduce
Can't really tell how to reproduce, since it just happens out of nowhere.
Expected Result
Sentry receives errors again
Actual Result
Sentry stops receiving errors after 1-2 days of normal usage.
Checking the docker logs there are a lot of these entries present:
relay-1 | 2024-09-17T06:33:00.437945Z ERROR relay_server::services::health_check: Not enough memory, 32351698944 / 33568419840 (96.38% >= 95.00%)
relay-1 | 2024-09-17T06:33:00.437982Z ERROR relay_server::services::health_check: Health check probe 'system memory' failed
but checking either htop we have enough RAM
as well as checking docker container stats there are no containers using > 95% RAM
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
48244b39b406 sentry-self-hosted-nginx-1 0.10% 11.74MiB / 31.26GiB 0.04% 829MB / 832MB 11.8MB / 131kB 13
4188cd46c1fe sentry-self-hosted-relay-1 0.54% 515.6MiB / 31.26GiB 1.61% 896MB / 2.06GB 363MB / 284MB 40
ca96eedda64d sentry-self-hosted-generic-metrics-consumer-1 0.56% 342.4MiB / 31.26GiB 1.07% 177MB / 290MB 17.8MB / 82MB 21
5dedcba8621c sentry-self-hosted-monitors-clock-tick-1 0.28% 162.4MiB / 31.26GiB 0.51% 36MB / 33.5MB 35.8MB / 29.5MB 6
8e98edce4698 sentry-self-hosted-subscription-consumer-generic-metrics-1 0.28% 323.9MiB / 31.26GiB 1.01% 37.7MB / 34.5MB 13.2MB / 68.7MB 13
ca542ffd958a sentry-self-hosted-attachments-consumer-1 0.48% 500.2MiB / 31.26GiB 1.56% 16.9MB / 15.2MB 26.8MB / 66.6MB 19
b300b32205a5 sentry-self-hosted-snuba-replacer-1 0.28% 115.6MiB / 31.26GiB 0.36% 35MB / 31.5MB 20.2MB / 68.9MB 5
1a77958a745c sentry-self-hosted-ingest-monitors-1 0.37% 169.8MiB / 31.26GiB 0.53% 17.2MB / 15.8MB 41.3MB / 19.8MB 11
b089c100ada2 sentry-self-hosted-worker-1 5.47% 1.444GiB / 31.26GiB 4.62% 9.41GB / 13.3GB 238MB / 126MB 227
8a69077b8025 sentry-self-hosted-snuba-replays-consumer-1 0.46% 157.2MiB / 31.26GiB 0.49% 35.4MB / 31.9MB 2.92MB / 123MB 30
d77331751542 sentry-self-hosted-events-consumer-1 0.37% 281.7MiB / 31.26GiB 0.88% 974MB / 1.02GB 18.1MB / 114MB 17
9414826f3339 sentry-self-hosted-subscription-consumer-transactions-1 0.26% 258.9MiB / 31.26GiB 0.81% 37.3MB / 34MB 27.9MB / 137MB 13
e59e3a288fbf sentry-self-hosted-vroom-1 0.00% 11.75MiB / 31.26GiB 0.04% 233kB / 0B 34.8MB / 4.35MB 11
0acb817cf62d sentry-self-hosted-snuba-subscription-consumer-events-1 0.39% 147.5MiB / 31.26GiB 0.46% 40.9MB / 34.8MB 7.13MB / 23.4MB 9
f996440e6e0d sentry-self-hosted-post-process-forwarder-issue-platform-1 0.57% 274.9MiB / 31.26GiB 0.86% 71.8MB / 65.6MB 22.7MB / 121MB 18
c9578898e6f1 sentry-self-hosted-sentry-cleanup-1 0.00% 7.328MiB / 31.26GiB 0.02% 260kB / 28.5kB 136MB / 557kB 6
c148234ce38f sentry-self-hosted-subscription-consumer-events-1 0.27% 226.4MiB / 31.26GiB 0.71% 36.4MB / 33.1MB 16.8MB / 171MB 13
4651f280662d sentry-self-hosted-metrics-consumer-1 0.50% 339.4MiB / 31.26GiB 1.06% 34.8MB / 31.4MB 21.2MB / 65.4MB 21
91bc0382285e sentry-self-hosted-ingest-profiles-1 0.27% 155.3MiB / 31.26GiB 0.49% 33.5MB / 30MB 15.7MB / 37MB 6
82f8557caafa sentry-self-hosted-snuba-metrics-consumer-1 0.55% 191.5MiB / 31.26GiB 0.60% 34.6MB / 31.4MB 2.88MB / 96.5MB 34
1c7c176be51e sentry-self-hosted-transactions-consumer-1 0.40% 347.4MiB / 31.26GiB 1.09% 30.6MB / 29.6MB 26.3MB / 42.5MB 17
7f04ed421f85 sentry-self-hosted-post-process-forwarder-errors-1 0.69% 350MiB / 31.26GiB 1.09% 32.8MB / 21.3MB 18.2MB / 43.4MB 23
d8ad278dcd0a sentry-self-hosted-ingest-occurrences-1 0.52% 146.3MiB / 31.26GiB 0.46% 36.9MB / 33.3MB 31.5MB / 58.7MB 16
a77658238b91 sentry-self-hosted-snuba-subscription-consumer-metrics-1 0.39% 132.6MiB / 31.26GiB 0.41% 36.2MB / 33.2MB 9.47MB / 41.3MB 9
8c9d85e099b8 sentry-self-hosted-web-1 0.12% 734.6MiB / 31.26GiB 2.29% 90.4MB / 332MB 346MB / 194MB 41
74ad5875b983 sentry-self-hosted-monitors-clock-tasks-1 0.25% 147.3MiB / 31.26GiB 0.46% 35.3MB / 31.9MB 16.9MB / 49MB 6
12994ef232b6 sentry-self-hosted-billing-metrics-consumer-1 0.46% 157.6MiB / 31.26GiB 0.49% 63.6MB / 37.2MB 13.5MB / 36MB 9
f7f6a96c18ec sentry-self-hosted-ingest-replay-recordings-1 0.43% 159.8MiB / 31.26GiB 0.50% 36.2MB / 32.7MB 21.8MB / 35.5MB 13
c0d724db862f sentry-self-hosted-snuba-issue-occurrence-consumer-1 0.55% 333.4MiB / 31.26GiB 1.04% 34.8MB / 31.3MB 25.1MB / 61.4MB 41
37fb6a1dbf5e sentry-self-hosted-cron-1 0.00% 179.1MiB / 31.26GiB 0.56% 17.4MB / 146MB 28.9MB / 45.1MB 3
6c46a69fa981 sentry-self-hosted-snuba-outcomes-billing-consumer-1 0.35% 200.9MiB / 31.26GiB 0.63% 18.3MB / 16.5MB 11MB / 49.2MB 26
564d6cedc04a sentry-self-hosted-post-process-forwarder-transactions-1 0.64% 397.1MiB / 31.26GiB 1.24% 5.76GB / 462MB 21.2MB / 104MB 23
f135db607aec sentry-self-hosted-subscription-consumer-metrics-1 0.28% 298.9MiB / 31.26GiB 0.93% 36.5MB / 33.3MB 19.1MB / 94MB 13
c3ebde813802 sentry-self-hosted-snuba-transactions-consumer-1 0.47% 322.3MiB / 31.26GiB 1.01% 21.4MB / 17MB 21.4MB / 60.4MB 37
4e4475215c07 sentry-self-hosted-ingest-feedback-events-1 0.39% 238.4MiB / 31.26GiB 0.74% 35.7MB / 32.1MB 12.8MB / 156MB 15
d266f38dc1aa sentry-self-hosted-snuba-spans-consumer-1 0.35% 179MiB / 31.26GiB 0.56% 142MB / 582MB 11.3MB / 101MB 26
756587f4a091 sentry-self-hosted-snuba-generic-metrics-gauges-consumer-1 0.54% 181.1MiB / 31.26GiB 0.57% 117MB / 37.3MB 3.19MB / 120MB 34
ea958e5724d3 sentry-self-hosted-snuba-profiling-profiles-consumer-1 0.34% 128.8MiB / 31.26GiB 0.40% 35MB / 31.5MB 2.27MB / 120MB 26
cc77019cca49 sentry-self-hosted-symbolicator-cleanup-1 0.00% 4.367MiB / 31.26GiB 0.01% 234kB / 0B 35.2MB / 0B 6
8c1ca1507b75 sentry-self-hosted-snuba-profiling-functions-consumer-1 0.34% 134.2MiB / 31.26GiB 0.42% 35MB / 31.5MB 2.53MB / 116MB 26
c479e6353a01 sentry-self-hosted-snuba-generic-metrics-counters-consumer-1 0.57% 255.2MiB / 31.26GiB 0.80% 18MB / 16.3MB 10.8MB / 23.2MB 34
a64b4393273e sentry-self-hosted-snuba-errors-consumer-1 0.57% 233.1MiB / 31.26GiB 0.73% 360MB / 317MB 4.6MB / 57.9MB 34
475833b0af89 sentry-self-hosted-snuba-generic-metrics-sets-consumer-1 0.67% 213.9MiB / 31.26GiB 0.67% 118MB / 41.6MB 6.73MB / 95.7MB 34
8a412f926058 sentry-self-hosted-snuba-generic-metrics-distributions-consumer-1 0.56% 270.6MiB / 31.26GiB 0.85% 124MB / 559MB 4.73MB / 60.4MB 34
4be03fe74a97 sentry-self-hosted-snuba-outcomes-consumer-1 0.33% 162.1MiB / 31.26GiB 0.51% 33.8MB / 30.4MB 2.17MB / 92.4MB 26
f9689ba0b412 sentry-self-hosted-snuba-group-attributes-consumer-1 0.47% 322MiB / 31.26GiB 1.01% 34.8MB / 31.7MB 17.9MB / 86.7MB 37
acb9904b4aa4 sentry-self-hosted-snuba-subscription-consumer-transactions-1 0.38% 123.5MiB / 31.26GiB 0.39% 42.5MB / 36.3MB 8.68MB / 41.4MB 9
c44eb35c0248 sentry-self-hosted-vroom-cleanup-1 0.00% 3.602MiB / 31.26GiB 0.01% 234kB / 0B 8.87MB / 0B 6
e896238b7399 sentry-self-hosted-memcached-1 0.03% 23.07MiB / 31.26GiB 0.07% 673MB / 1.81GB 8.15MB / 2.45MB 10
f5170f49f238 sentry-self-hosted-snuba-api-1 0.05% 113.7MiB / 31.26GiB 0.36% 9.83MB / 16.3MB 66.3MB / 68.9MB 5
f7f090a153a3 sentry-self-hosted-symbolicator-1 0.00% 35MiB / 31.26GiB 0.11% 307kB / 59.4kB 25.5MB / 142MB 38
dec490aea24f sentry-self-hosted-smtp-1 0.00% 1.371MiB / 31.26GiB 0.00% 255kB / 15.7kB 28.8MB / 4.1kB 2
3e57a3611024 sentry-self-hosted-postgres-1 0.01% 231.6MiB / 31.26GiB 0.72% 1.53GB / 728MB 17.6GB / 13.2MB 53
c5eaa992ea9b sentry-self-hosted-kafka-1 1.97% 1.257GiB / 31.26GiB 4.02% 5.5GB / 11.8GB 1.73GB / 461MB 111
4bac36975fe0 sentry-self-hosted-clickhouse-1 0.39% 469MiB / 31.26GiB 1.47% 1.71GB / 83.7MB 3.21GB / 93.3MB 481
f1f35f95d7a7 sentry-self-hosted-redis-1 0.17% 56.23MiB / 31.26GiB 0.18% 11.9GB / 7.22GB 763MB / 2.97MB 5
docker_compose_logs.txt latest_install_logs.txt
Event ID
No response
Duplicate of https://github.com/getsentry/self-hosted/issues/3327
I think it's a memory leak.
What I can confirm is the fact, that doing the upgrade to 24.9.0 did NOT fix the issue. After a few hours of incoming events I repeatedly get the same
relay-1 | 2024-09-17T14:16:40.186772Z ERROR relay_server::services::health_check: Not enough memory, 32202633216 / 33568419840 (95.93% >= 95.00%)
relay-1 | 2024-09-17T14:16:40.186811Z ERROR relay_server::services::health_check: Health check probe 'system memory' failed
relay-1 | 2024-09-17T14:16:40.449995Z ERROR relay_server::endpoints::common: error handling request error=failed to queue envelope
error in the docker compose logs.
What does the event volume look like for you? Did this start happening after upgrading to 24.8.0?
We did the 24.8.0 Sentry update on the 1st of September.
This is our stats page for the last 30 days
As you can see there are sections where it works fine but then sometimes for a few hours, sometimes even for days there are no events being processed.
Could you track your RAM/CPU usage as well? Wondering if there is a correlation there.
I can also see errors related to https://github.com/getsentry/snuba/issues/5707 in my logs
postgres-1 | 2024-09-20 11:04:34.677 UTC [660670] ERROR: duplicate key value violates unique constraint "sentry_environmentprojec_project_id_environment_i_91da82f2_uniq"
postgres-1 | 2024-09-20 11:04:34.677 UTC [660670] DETAIL: Key (project_id, environment_id)=(76, 1) already exists.
postgres-1 | 2024-09-20 11:04:34.677 UTC [660670] STATEMENT: INSERT INTO "sentry_environmentproject" ("project_id", "environment_id", "is_hidden") VALUES (76, 1, NULL) RETURNING "sentry_environmentproject"."id"
postgres-1 | 2024-09-20 11:04:34.692 UTC [660670] ERROR: duplicate key value violates unique constraint "sentry_grouprelease_group_id_release_id_envi_044354c8_uniq"
postgres-1 | 2024-09-20 11:04:34.692 UTC [660670] DETAIL: Key (group_id, release_id, environment)=(385, 413, production) already exists.
postgres-1 | 2024-09-20 11:04:34.692 UTC [660670] STATEMENT: INSERT INTO "sentry_grouprelease" ("project_id", "group_id", "release_id", "environment", "first_seen", "last_seen") VALUES (76, 385, 413, 'production', '2024-09-20T11:04:33.517322+00:00'::timestamptz, '2024-09-20T11:04:33.517322+00:00'::timestamptz) RETURNING "sentry_grouprelease"."id"
postgres-1 | 2024-09-20 11:04:34.699 UTC [660673] ERROR: duplicate key value violates unique constraint "sentry_grouprelease_group_id_release_id_envi_044354c8_uniq"
postgres-1 | 2024-09-20 11:04:34.699 UTC [660673] DETAIL: Key (group_id, release_id, environment)=(6909, 413, production) already exists.
postgres-1 | 2024-09-20 11:04:34.699 UTC [660673] STATEMENT: INSERT INTO "sentry_grouprelease" ("project_id", "group_id", "release_id", "environment", "first_seen", "last_seen") VALUES (76, 6909, 413, 'production', '2024-09-20T11:04:33.688391+00:00'::timestamptz, '2024-09-20T11:04:33.688391+00:00'::timestamptz) RETURNING "sentry_grouprelease"."id"
nginx-1 | 144.208.193.56 - - [20/Sep/2024:11:04:38 +0000] "POST /api/49/envelope/ HTTP/1.0" 200 41 "-" "sentry.php.wordpress/8.1.0" "91.227.205.222"
nginx-1 | 144.208.193.56 - - [20/Sep/2024:11:04:38 +0000] "POST /api/2/envelope/ HTTP/1.0" 200 41 "-" "sentry.php/4.9.0" "2a01:aea0:df3:1::153"
nginx-1 | 144.208.193.56 - - [20/Sep/2024:11:04:38 +0000] "POST /api/2/envelope/ HTTP/1.0" 200 41 "-" "sentry.php/4.9.0" "2a01:aea0:df3:1::153"
nginx-1 | 144.208.193.56 - - [20/Sep/2024:11:04:38 +0000] "POST /api/2/envelope/ HTTP/1.0" 200 41 "-" "sentry.php/4.9.0" "2a01:aea0:df3:1::153"
postgres-1 | 2024-09-20 11:04:39.833 UTC [660680] ERROR: duplicate key value violates unique constraint "sentry_environmentprojec_project_id_environment_i_91da82f2_uniq"
postgres-1 | 2024-09-20 11:04:39.833 UTC [660680] DETAIL: Key (project_id, environment_id)=(49, 14) already exists.
postgres-1 | 2024-09-20 11:04:39.833 UTC [660680] STATEMENT: INSERT INTO "sentry_environmentproject" ("project_id", "environment_id", "is_hidden") VALUES (49, 14, NULL) RETURNING "sentry_environmentproject"."id"
clickhouse-1 | 2024.09.20 11:04:40.173549 [ 188178 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
clickhouse-1 |
clickhouse-1 | 0. Poco::Net::SocketImpl::error(int, String const&) @ 0x0000000015b3dbf2 in /usr/bin/clickhouse
clickhouse-1 | 1. Poco::Net::SocketImpl::peerAddress() @ 0x0000000015b40376 in /usr/bin/clickhouse
clickhouse-1 | 2. DB::HTTPServerRequest::HTTPServerRequest(std::shared_ptr<DB::IHTTPContext>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x0000000013154417 in /usr/bin/clickhouse
clickhouse-1 | 3. DB::HTTPServerConnection::run() @ 0x0000000013152ba4 in /usr/bin/clickhouse
clickhouse-1 | 4. Poco::Net::TCPServerConnection::start() @ 0x0000000015b42834 in /usr/bin/clickhouse
clickhouse-1 | 5. Poco::Net::TCPServerDispatcher::run() @ 0x0000000015b43a31 in /usr/bin/clickhouse
clickhouse-1 | 6. Poco::PooledThread::run() @ 0x0000000015c7a667 in /usr/bin/clickhouse
clickhouse-1 | 7. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000015c7893c in /usr/bin/clickhouse
clickhouse-1 | 8. ? @ 0x00007f2c86ec0609 in ?
clickhouse-1 | 9. ? @ 0x00007f2c86de5353 in ?
clickhouse-1 | (version 23.8.11.29.altinitystable (altinity build))
nginx-1 | 144.208.193.56 - - [20/Sep/2024:11:04:45 +0000] "POST /api/14/envelope/ HTTP/1.0" 200 41 "-" "sentry.php.wordpress/8.1.0" "91.227.205.222"
nginx-1 | 144.208.193.56 - - [20/Sep/2024:11:04:45 +0000] "POST /api/2/envelope/ HTTP/1.0" 200 41 "-" "sentry.php/4.9.0" "2a01:aea0:df3:1::153"
nginx-1 | 144.208.193.56 - - [20/Sep/2024:11:04:46 +0000] "POST /api/14/envelope/ HTTP/1.0" 200 41 "-" "sentry.php.wordpress/8.1.0" "91.227.205.222"
nginx-1 | 144.208.193.56 - - [20/Sep/2024:11:04:48 +0000] "POST /api/14/envelope/ HTTP/1.0" 200 41 "-" "sentry.php.wordpress/8.1.0" "91.227.205.222"
clickhouse-1 | 2024.09.20 11:04:48.941366 [ 188178 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
clickhouse-1 |
clickhouse-1 | 0. Poco::Net::SocketImpl::error(int, String const&) @ 0x0000000015b3dbf2 in /usr/bin/clickhouse
clickhouse-1 | 1. Poco::Net::SocketImpl::peerAddress() @ 0x0000000015b40376 in /usr/bin/clickhouse
clickhouse-1 | 2. DB::ReadBufferFromPocoSocket::ReadBufferFromPocoSocket(Poco::Net::Socket&, unsigned long) @ 0x000000000c896cc6 in /usr/bin/clickhouse
clickhouse-1 | 3. DB::HTTPServerRequest::HTTPServerRequest(std::shared_ptr<DB::IHTTPContext>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x000000001315451b in /usr/bin/clickhouse
clickhouse-1 | 4. DB::HTTPServerConnection::run() @ 0x0000000013152ba4 in /usr/bin/clickhouse
clickhouse-1 | 5. Poco::Net::TCPServerConnection::start() @ 0x0000000015b42834 in /usr/bin/clickhouse
clickhouse-1 | 6. Poco::Net::TCPServerDispatcher::run() @ 0x0000000015b43a31 in /usr/bin/clickhouse
clickhouse-1 | 7. Poco::PooledThread::run() @ 0x0000000015c7a667 in /usr/bin/clickhouse
clickhouse-1 | 8. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000015c7893c in /usr/bin/clickhouse
clickhouse-1 | 9. ? @ 0x00007f2c86ec0609 in ?
clickhouse-1 | 10. ? @ 0x00007f2c86de5353 in ?
clickhouse-1 | (version 23.8.11.29.altinitystable (altinity build))
I will have to try and get some system metric info stats running to give you the requested info
Hey @LordSimal can you try this:
On your relay/config.yml file (https://github.com/getsentry/self-hosted/blob/master/relay/config.example.yml) add a health section, so it'd be:
relay:
upstream: "http://web:9000/"
host: 0.0.0.0
port: 3000
logging:
level: WARN
processing:
enabled: true
kafka_config:
- {name: "bootstrap.servers", value: "kafka:9092"}
- {name: "message.max.bytes", value: 50000000} # 50MB
redis: redis://redis:6379
geoip_path: "/geoip/GeoLite2-City.mmdb"
health:
max_memory_percent: 1.0
Then do sudo docker compose up -d relay (or sudo docker compose --env-file .env.custom up -d relay), if things didn't change, try restarting the relay container.
Thanks to @Dav1dde
Hey @LordSimal can you try this:
On your
relay/config.ymlfile (https://github.com/getsentry/self-hosted/blob/master/relay/config.example.yml) add ahealthsection, so it'd be:relay: upstream: "http://web:9000/" host: 0.0.0.0 port: 3000 logging: level: WARN processing: enabled: true kafka_config: - {name: "bootstrap.servers", value: "kafka:9092"} - {name: "message.max.bytes", value: 50000000} # 50MB redis: redis://redis:6379 geoip_path: "/geoip/GeoLite2-City.mmdb" health: max_memory_percent: 1.0 Then do
sudo docker compose up -d relay(orsudo docker compose --env-file .env.custom up -d relay), if things didn't change, try restarting the relay container.Thanks to @Dav1dde
Interesting, If you don't mind could you elaborate the changes? is it the same with setup a resource limits to each container ?
Just wanna post my current stats of the last 3 days before I do this change
The 21st is not shown in the stats here for some reason but there haven't been any events since yesterday 1PM (1 day 7 hours)
I don't really understand the RAM Usage graph here since htop says there is only 13,7 GB used of 31,3 GB
But maybe NGINX Amplify uses a different RAM usage stat than htop.
Do we really have a RAM usage error here? Is 32GB for sentry not enough? This worked fine in older versions with exactly this server.
Adjusted the relay/config.yml and executed docker compose up -d relayto restart the relay container.
Nothing changed till now even though there are 100% events which should be coming in.
Should I try to just run ./install.sh again to do a "fresh restart"? I think this worked in the past.
Just wanna post my current stats of the last 3 days before I do this change
The 21st is not shown in the stats here for some reason but there haven't been any events since yesterday 1PM (1 day 7 hours)
I don't really understand the RAM Usage graph here since htop says there is only 13,7 GB used of 31,3 GB
But maybe NGINX Amplify uses a different RAM usage stat than htop.
Do we really have a RAM usage error here? Is 32GB for sentry not enough? This worked fine in older versions with exactly this server.
What is java process?
What is java process?
Its the kafka process
root@scarecrow:~# ps -ax | grep java
1969 pts/0 S+ 0:00 grep java
17703 ? Ssl 183:39 java -Xmx1G -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xlog:gc*:file=/var/log/kafka/kafkaServer-gc.log:time,tags:filecount=10,filesize=100M -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/log/kafka -Dlog4j.configuration=file:/etc/kafka/log4j.properties -cp /usr/bin/../share/java/kafka/*:/usr/bin/../share/java/confluent-telemetry/* kafka.Kafka /etc/kafka/kafka.properties
I got some news.... I just executed
docker compose down
./install.sh
docker compose up -d
and suddenly the stats page has updated and there seems to be events present which were not previously...
Also events are being processed right now and my sever is pinned at a 100% usage
Seems like something prevented the queue worker to process the queued events.
After around ~15 minutes all queued up events seem to have been processed and now the load is normal again. Also new sentry events are coming up pretty much instantly in the UI as they have been in the past.
Will wait now if the problem re-occurs again.
Interesting, If you don't mind could you elaborate the changes? is it the same with setup a resource limits to each container ?
The max_memory_percent: 1.0 just effectively disables the memory check, it would only fail if 100% of the OS resources are used, at this point you'd have the OOM killer already kick in.
Created a Relay issue to investigate why Relay seems to think more memory is used than it actually is: https://github.com/getsentry/relay/issues/4059
Till now everything works fine again.
I do indeed have the relay config set as you requested and no memory errors have been reported yet by the relay container.
root@scarecrow:~/sentry# docker compose logs -n 1000 relay
relay-1 | 2024-09-22T09:08:52.572625Z ERROR relay_server::services::health_check: Health check probe 'auth' failed
relay-1 | 2024-09-22T11:17:56.187838Z ERROR relay_server::services::project_upstream: error fetching project states error=could not send request to upstream error.sources=[error sending request for url (http://web:9000/api/0/relays/projectconfigs/?version=3), client error (Connect), operation timed out] attempts=2
relay-1 | 2024-09-22T11:18:02.993599Z WARN relay_server::services::upstream: network outage, scheduling another check in 0ns
relay-1 | 2024-09-22T11:18:02.993674Z ERROR relay_server::services::project_upstream: error fetching project states error=could not send request to upstream error.sources=[error sending request for url (http://web:9000/api/0/relays/projectconfigs/?version=3), operation timed out] attempts=3
relay-1 | 2024-09-22T11:33:29.321590Z WARN relay_server::services::upstream: network outage, scheduling another check in 0ns
relay-1 | 2024-09-22T11:33:34.361859Z WARN relay_server::services::upstream: network outage, scheduling another check in 1s
relay-1 | 2024-09-22T15:43:05.212806Z ERROR relay_server::services::project_cache: failed to fetch project from Redis error=failed to talk to redis error.sources=[failed to communicate with redis, Resource temporarily unavailable (os error 11)]
relay-1 | 2024-09-23T00:00:42.685558Z WARN relay_server::services::upstream: network outage, scheduling another check in 0ns
relay-1 | 2024-09-23T00:00:47.760235Z WARN relay_server::services::upstream: network outage, scheduling another check in 1s
relay-1 | 2024-09-23T00:00:53.778282Z WARN relay_server::services::upstream: network outage, scheduling another check in 1.5s
relay-1 | 2024-09-23T00:01:26.554169Z WARN relay_server::services::upstream: network outage, scheduling another check in 0ns
relay-1 | 2024-09-23T00:01:41.630586Z ERROR relay_server::services::global_config: failed to fetch global config from upstream error=could not send request to upstream error.sources=[error sending request for url (http://web:9000/api/0/relays/projectconfigs/?version=3), operation timed out]
relay-1 | 2024-09-23T00:49:21.795499Z WARN relay_server::services::upstream: network outage, scheduling another check in 0ns
relay-1 | 2024-09-23T00:49:26.810505Z WARN relay_server::services::upstream: network outage, scheduling another check in 1s
Interesting, If you don't mind could you elaborate the changes? is it the same with setup a resource limits to each container ?
The
max_memory_percent: 1.0just effectively disables the memory check, it would only fail if 100% of the OS resources are used, at this point you'd have the OOM killer already kick in.Created a Relay issue to investigate why Relay seems to think more memory is used than it actually is: getsentry/relay#4059
Great, Thank you for explanation
So its been nearly 24h of uninterrupted, working sentry... BUT it happened again. No events are being processed since 10 hours ago.
As you can see this is sorted by Last seen
Here is my server stats report from the last 24h
And here is the output of docker compose logs --since 12h > 12h-logs.txt
At the end of that file just have a bunch of
clickhouse-1 | 2024.09.23 11:24:50.987265 [ 65743 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
clickhouse-1 |
clickhouse-1 | 0. Poco::Net::SocketImpl::error(int, String const&) @ 0x0000000015b3dbf2 in /usr/bin/clickhouse
clickhouse-1 | 1. Poco::Net::SocketImpl::peerAddress() @ 0x0000000015b40376 in /usr/bin/clickhouse
clickhouse-1 | 2. DB::ReadBufferFromPocoSocket::ReadBufferFromPocoSocket(Poco::Net::Socket&, unsigned long) @ 0x000000000c896cc6 in /usr/bin/clickhouse
clickhouse-1 | 3. DB::HTTPServerRequest::HTTPServerRequest(std::shared_ptr<DB::IHTTPContext>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x000000001315451b in /usr/bin/clickhouse
clickhouse-1 | 4. DB::HTTPServerConnection::run() @ 0x0000000013152ba4 in /usr/bin/clickhouse
clickhouse-1 | 5. Poco::Net::TCPServerConnection::start() @ 0x0000000015b42834 in /usr/bin/clickhouse
clickhouse-1 | 6. Poco::Net::TCPServerDispatcher::run() @ 0x0000000015b43a31 in /usr/bin/clickhouse
clickhouse-1 | 7. Poco::PooledThread::run() @ 0x0000000015c7a667 in /usr/bin/clickhouse
clickhouse-1 | 8. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000015c7893c in /usr/bin/clickhouse
clickhouse-1 | 9. ? @ 0x00007f10b149a609 in ?
clickhouse-1 | 10. ? @ 0x00007f10b13bf353 in ?
clickhouse-1 | (version 23.8.11.29.altinitystable (altinity build))
So its been nearly 24h of uninterrupted, working sentry... BUT it happened again. No events are being processed since 10 hours ago.
As you can see this is sorted by
Last seenHere is my server stats report from the last 24h
And here is the output of
docker compose logs --since 12h > 12h-logs.txtAt the end of that file just have a bunch of
clickhouse-1 | 2024.09.23 11:24:50.987265 [ 65743 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below): clickhouse-1 | clickhouse-1 | 0. Poco::Net::SocketImpl::error(int, String const&) @ 0x0000000015b3dbf2 in /usr/bin/clickhouse clickhouse-1 | 1. Poco::Net::SocketImpl::peerAddress() @ 0x0000000015b40376 in /usr/bin/clickhouse clickhouse-1 | 2. DB::ReadBufferFromPocoSocket::ReadBufferFromPocoSocket(Poco::Net::Socket&, unsigned long) @ 0x000000000c896cc6 in /usr/bin/clickhouse clickhouse-1 | 3. DB::HTTPServerRequest::HTTPServerRequest(std::shared_ptr<DB::IHTTPContext>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x000000001315451b in /usr/bin/clickhouse clickhouse-1 | 4. DB::HTTPServerConnection::run() @ 0x0000000013152ba4 in /usr/bin/clickhouse clickhouse-1 | 5. Poco::Net::TCPServerConnection::start() @ 0x0000000015b42834 in /usr/bin/clickhouse clickhouse-1 | 6. Poco::Net::TCPServerDispatcher::run() @ 0x0000000015b43a31 in /usr/bin/clickhouse clickhouse-1 | 7. Poco::PooledThread::run() @ 0x0000000015c7a667 in /usr/bin/clickhouse clickhouse-1 | 8. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000015c7893c in /usr/bin/clickhouse clickhouse-1 | 9. ? @ 0x00007f10b149a609 in ? clickhouse-1 | 10. ? @ 0x00007f10b13bf353 in ? clickhouse-1 | (version 23.8.11.29.altinitystable (altinity build))
I have a theory that the java process may not be calling garbage collection because ram is high, which may indirectly cause the sentry to not be able to allocate the memory it wants.
well something inside sentry changed with 24.8.0 to cause this. Can I downgrade sentry to 24.7.1 to test this?
So its been nearly 24h of uninterrupted, working sentry... BUT it happened again. No events are being processed since 10 hours ago.
As you can see this is sorted by
Last seenHere is my server stats report from the last 24h
@LordSimal By any chance, do you have any swapfile configured? If you do, how many GBs are allocated for swap? I can only see 32 GB is allocated for regular RAM.
And here is the output of
docker compose logs --since 12h > 12h-logs.txtAt the end of that file just have a bunch of
clickhouse-1 | 2024.09.23 11:24:50.987265 [ 65743 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below): clickhouse-1 | clickhouse-1 | 0. Poco::Net::SocketImpl::error(int, String const&) @ 0x0000000015b3dbf2 in /usr/bin/clickhouse clickhouse-1 | 1. Poco::Net::SocketImpl::peerAddress() @ 0x0000000015b40376 in /usr/bin/clickhouse clickhouse-1 | 2. DB::ReadBufferFromPocoSocket::ReadBufferFromPocoSocket(Poco::Net::Socket&, unsigned long) @ 0x000000000c896cc6 in /usr/bin/clickhouse clickhouse-1 | 3. DB::HTTPServerRequest::HTTPServerRequest(std::shared_ptr<DB::IHTTPContext>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x000000001315451b in /usr/bin/clickhouse clickhouse-1 | 4. DB::HTTPServerConnection::run() @ 0x0000000013152ba4 in /usr/bin/clickhouse clickhouse-1 | 5. Poco::Net::TCPServerConnection::start() @ 0x0000000015b42834 in /usr/bin/clickhouse clickhouse-1 | 6. Poco::Net::TCPServerDispatcher::run() @ 0x0000000015b43a31 in /usr/bin/clickhouse clickhouse-1 | 7. Poco::PooledThread::run() @ 0x0000000015c7a667 in /usr/bin/clickhouse clickhouse-1 | 8. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000015c7893c in /usr/bin/clickhouse clickhouse-1 | 9. ? @ 0x00007f10b149a609 in ? clickhouse-1 | 10. ? @ 0x00007f10b13bf353 in ? clickhouse-1 | (version 23.8.11.29.altinitystable (altinity build))
ClickHouse logs is not an issue, it's just saying that connection is closed prematurely by the client. Nothing harmful from that (but if you do centralized logging and output every syslog to that server, yes, you'll have a disk space problem on your centralized logging server). The issue for this is here: https://github.com/getsentry/snuba/issues/5707
Although from your logs, I'm seeing something weird related to your Kafka.
kafka-1 | [2024-09-23 07:42:35,763] WARN [GroupCoordinator 1001]: Failed to write empty metadata for group snuba-spans-consumers: This is not the correct coordinator. (kafka.coordinator.group.GroupCoordinator)
kafka-1 | [2024-09-23 07:42:35,763] WARN [GroupCoordinator 1001]: Failed to write empty metadata for group snuba-consumers: This is not the correct coordinator. (kafka.coordinator.group.GroupCoordinator)
events-consumer-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1 | File "/.venv/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
events-consumer-1 | return ctx.invoke(self.callback, **ctx.params)
events-consumer-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1 | File "/.venv/lib/python3.12/site-packages/click/core.py", line 783, in invoke
events-consumer-1 | return __callback(*args, **kwargs)
events-consumer-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1 | File "/.venv/lib/python3.12/site-packages/click/decorators.py", line 33, in new_func
events-consumer-1 | return f(get_current_context(), *args, **kwargs)
events-consumer-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1 | File "/usr/src/sentry/src/sentry/runner/decorators.py", line 83, in inner
events-consumer-1 | return ctx.invoke(f, *args, **kwargs)
events-consumer-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1 | File "/.venv/lib/python3.12/site-packages/click/core.py", line 783, in invoke
events-consumer-1 | return __callback(*args, **kwargs)
events-consumer-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1 | File "/.venv/lib/python3.12/site-packages/click/decorators.py", line 33, in new_func
events-consumer-1 | return f(get_current_context(), *args, **kwargs)
events-consumer-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1 | File "/usr/src/sentry/src/sentry/runner/decorators.py", line 35, in inner
events-consumer-1 | return ctx.invoke(f, *args, **kwargs)
events-consumer-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1 | File "/.venv/lib/python3.12/site-packages/click/core.py", line 783, in invoke
events-consumer-1 | return __callback(*args, **kwargs)
events-consumer-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1 | File "/usr/src/sentry/src/sentry/runner/commands/run.py", line 386, in basic_consumer
events-consumer-1 | run_processor_with_signals(processor, consumer_name)
events-consumer-1 | File "/usr/src/sentry/src/sentry/utils/kafka.py", line 46, in run_processor_with_signals
events-consumer-1 | processor.run()
events-consumer-1 | File "/.venv/lib/python3.12/site-packages/arroyo/processing/processor.py", line 322, in run
events-consumer-1 | self._run_once()
events-consumer-1 | File "/.venv/lib/python3.12/site-packages/arroyo/processing/processor.py", line 410, in _run_once
events-consumer-1 | self.__processing_strategy.submit(message)
events-consumer-1 | File "/.venv/lib/python3.12/site-packages/arroyo/processing/strategies/guard.py", line 82, in submit
events-consumer-1 | self.__inner_strategy.submit(message)
events-consumer-1 | File "/.venv/lib/python3.12/site-packages/arroyo/processing/strategies/run_task.py", line 52, in submit
events-consumer-1 | self.__next_step.submit(Message(value))
events-consumer-1 | File "/.venv/lib/python3.12/site-packages/arroyo/processing/strategies/guard.py", line 34, in submit
events-consumer-1 | self.__next_step.submit(message)
events-consumer-1 | File "/.venv/lib/python3.12/site-packages/arroyo/processing/strategies/guard.py", line 82, in submit
events-consumer-1 | self.__inner_strategy.submit(message)
events-consumer-1 | File "/.venv/lib/python3.12/site-packages/arroyo/processing/strategies/run_task.py", line 52, in submit
events-consumer-1 | self.__next_step.submit(Message(value))
events-consumer-1 | File "/.venv/lib/python3.12/site-packages/arroyo/processing/strategies/guard.py", line 34, in submit
events-consumer-1 | self.__next_step.submit(message)
events-consumer-1 | File "/.venv/lib/python3.12/site-packages/arroyo/processing/strategies/commit.py", line 34, in submit
events-consumer-1 | self.__commit(message.committable)
events-consumer-1 | File "/.venv/lib/python3.12/site-packages/arroyo/processing/processor.py", line 308, in __commit
events-consumer-1 | self.__consumer.commit_offsets()
events-consumer-1 | File "/.venv/lib/python3.12/site-packages/arroyo/backends/kafka/consumer.py", line 609, in commit_offsets
events-consumer-1 | return self.__commit_retry_policy.call(self.__commit)
events-consumer-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1 | File "/.venv/lib/python3.12/site-packages/arroyo/utils/retries.py", line 88, in call
events-consumer-1 | return callable()
events-consumer-1 | ^^^^^^^^^^
events-consumer-1 | File "/.venv/lib/python3.12/site-packages/arroyo/backends/kafka/consumer.py", line 567, in __commit
events-consumer-1 | result = self.__consumer.commit(
events-consumer-1 | ^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1 | cimpl.KafkaException: KafkaError{code=COORDINATOR_LOAD_IN_PROGRESS,val=14,str="Commit failed: Broker: Coordinator load in progress"}
The event-consumer container is a sentry consumer that's written in Python, by utilizing Django stuff. But, seeing from the logs, the problem may lays on the internal connection into your Kafka. Perhaps you can increase the nofile limits here: https://github.com/getsentry/self-hosted/blob/5bd6cd3710cc214b2d68858d24a7d6bcf8149d73/docker-compose.yml#L158-L161
Or.. you can migrate your Kafka into Redpanda, it's a drop in replacement, all you'll need to do is to re-run ./install.sh.
Or.. you can migrate your Kafka into Redpanda, it's a drop in replacement, all you'll need to do is to re-run ./install.sh.
Is it cannon? any env variable we should changes for implement it?
Regards, Baskoro
I have adjusted the soft and hard ulimit values to 8192 and executed ./install.sh
again, the queue worker goes ham and processes everything, that has been queued up
we will see what it will bring in the next 24h
By any chance, do you have any swapfile configured? If you do, how many GBs are allocated for swap? I can only see 32 GB is allocated for regular RAM.
yes, we do have lots of swap available as you can see in my htop screenshots above (64GB)
Let me know if I should create a new issue, but I have almost the same situation.
Self-Hosted Version 24.8.0
CPU Architecture x86_64
Docker Version 26.0.0
Docker Compose Version 2.25.0
For me Sentry stopped processing issues with version 24.5.0 a couple of months ago. When I restart it, it works for a couple of days and then stops processing again. Here is an example from last Saturday, when it failed during weekend.
In docker compose logs, I can see lines like these:
relay-1 | 2024-09-23T15:24:03.996749240Z 2024-09-23T15:24:03.996335Z ERROR relay_server::services::health_check: Not enough memory, 16059187200 / 16773009408 (95.74% >= 95.00%)
relay-1 | 2024-09-23T15:24:06.999942277Z 2024-09-23T15:24:06.997660Z ERROR relay_server::services::health_check: Not enough memory, 15983144960 / 16773009408 (95.29% >= 95.00%)
relay-1 | 2024-09-23T15:24:58.060185500Z 2024-09-23T15:24:58.059995Z ERROR relay_server::services::health_check: Not enough memory, 16002109440 / 16773009408 (95.40% >= 95.00%)
But when I check the memory with htop, there is about 8/16GB of RAM used. I have also 16GB of swap and its usage varies between 0-10GB.
Yesterday I got it processing again by restarting the whole server. After that I upgraded 24.5.0 -> 24.8.0.
For the past couple of months we have been monitoring Kafka as it seems there is lag every time we have problems with processing. The oneliner used for monitoring:
docker exec sentry-self-hosted-kafka-1 kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group post-process-forwarder | awk -v ts=$(date +%s) 'NR > 1 {print $2 "," $6 "," ts}' | grep -v -e "TOPIC\|generic-events" >> /tmp/kafka_lag_report.csv
I'm not sure, if it's related, but in postgresql container logs I see these:
worker-1 | 2024-09-23T11:55:05.254675423Z sentry.models.environment.Environment.MultipleObjectsReturned: get() returned more than one Environment -- it returned 2!
And the worker logs are full of these:
worker-1 | 00:38:17 [ERROR] celery.app.trace: Task sentry.tasks.store.save_event_transaction[865f23da-0dfe-40c7-b360-63152829cf95] raised unexpected: MultipleObjectsReturned('get() returned more than one Environment -- it returned 2!') (data={'hostname': 'celery@5aab0274fd34', 'id': '865f23da-0dfe-40c7-b360-63152829cf95', 'name': 'sentry.tasks.store.save_event_transaction', 'exc': "MultipleObjectsReturned('get() returned more than one Environment -- it returned 2!')", 'traceback': 'Traceback (most recent call last):\n File "/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 477, in trace_task\n R = retval = fun(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^\n File "/.venv/lib/python3.11/site-packages/sentry_sdk/utils.py", line 1720, in runner\n return sentry_patched_function(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/.venv/lib/python3.11/site-packages/sentry_sdk/integrations/celery/__init__.py", line 406, in _inner\n reraise(*exc_info)\n File "/.venv/lib/python3.11/site-packages/sentry_sdk/utils.py", line 1649, in reraise\n raise value\n File "/.venv/lib/python3.11/site-packages/sentry_sdk/integrations/celery/__init__.py", line 401, in _inner\n return f(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^\n File "/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 760, in __protected_call__\n return self.run(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/usr/src/sentry/src/sentry/silo/base.py", line 148, in override\n return original_method(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/usr/src/sentry/src/sentry/tasks/base.py", line 128, in _wrapped\n result = func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File "/usr/src/sentry/src/sentry/tasks/store.py", line 678, in save_event_transaction\n _do_save_event(cache_key, data, start_time, event_id, project_id, **kwargs)\n File "/usr/src/sentry/src/sentry/tasks/store.py", line 554, in _do_save_event\n manager.save(\n File "/.venv/lib/python3.11/site-packages/sentry_sdk/tracing_utils.py", line 679, in func_with_tracing\n return func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File "/usr/src/sentry/src/sentry/event_manager.py", line 502, in save\n jobs = save_transaction_events([job], projects)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/.venv/lib/python3.11/site-packages/sentry_sdk/tracing_utils.py", line 679, in func_with_tracing\n return func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File "/usr/src/sentry/src/sentry/event_manager.py", line 3065, in save_transaction_events\n _get_or_create_environment_many(jobs, projects)\n File "/.venv/lib/python3.11/site-packages/sentry_sdk/tracing_utils.py", line 679, in func_with_tracing\n return func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File "/usr/src/sentry/src/sentry/event_manager.py", line 977, in _get_or_create_environment_many\n job["environment"] = Environment.get_or_create(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/usr/src/sentry/src/sentry/models/environment.py", line 98, in get_or_create\n env = cls.objects.get_or_create(name=name, organization_id=project.organization_id)[\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/usr/src/sentry/src/sentry/silo/base.py", line 148, in override\n return original_method(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/.venv/lib/python3.11/site-packages/django/db/models/manager.py", line 87, in manager_method\n return getattr(self.get_queryset(), name)(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/usr/src/sentry/src/sentry/silo/base.py", line 148, in override\n return original_method(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/.venv/lib/python3.11/site-packages/django/db/models/query.py", line 948, in get_or_create\n return self.get(**kwargs), False\n ^^^^^^^^^^^^^^^^^^\n File "/.venv/lib/python3.11/site-packages/django/db/models/query.py", line 652, in get\n raise self.model.MultipleObjectsReturned(\nsentry.models.environment.Environment.MultipleObjectsReturned: get() returned more than one Environment -- it returned 2!\n', 'args': '()', 'kwargs': "{'cache_key': 'e:a8db55e3cc9149168cc68a3b81ab5c44:40', 'data': None, 'start_time': 1727138295.0, 'event_id': 'a8db55e3cc9149168cc68a3b81ab5c44', 'project_id': 40, '__start_time': 1727138297.471802}", 'description': 'raised unexpected', 'internal': False})
I listed duplicated environments with this script and there were many of those. I couldn't merge them with the other script. The latter script just gives me error:
raise TransactionMissingDBException("'using' must be specified when creating a transaction")
sentry.silo.patches.silo_aware_transaction_patch.TransactionMissingDBException: 'using' must be specified when creating a transaction
I haven't yet changed max_memory_percent or ulimit. Is there anything else I could try or shall I just repeat the same steps as LordSimal?
Adjusting the ulimits seems to only have increased the duration of working sentry from 1 day to 2 days... It broke again today at 02 AM CEST
Here my logs from the last 12 hours (it happened 7h 40 minutes ago) 12h_logs.txt.gz
will execute ./install.sh to get it working again.
Adjusting the ulimits seems to only have increased the duration of working sentry from 1 day to 2 days... It broke again today at 02 AM CEST
Here my logs from the last 12 hours (it happened 7h 40 minutes ago) 12h_logs.txt.gz
will execute
./install.shto get it working again.
I still think the problem is the java process, can you restart the java process when the sentry breaks?
I still think the problem is the java process, can you restart the java process when the sentry breaks?
So i should only restart the kafka container is what you are saying?
Yeah