self-hosted icon indicating copy to clipboard operation
self-hosted copied to clipboard

Relay Server: Not Enough Memory on Health Check even though stats show otherwise

Open LordSimal opened this issue 1 year ago • 45 comments

Self-Hosted Version

24.8.0

CPU Architecture

x86_64

Docker Version

27.2.1

Docker Compose Version

2.29.2

Steps to Reproduce

Can't really tell how to reproduce, since it just happens out of nowhere.

Expected Result

Sentry receives errors again

Actual Result

Sentry stops receiving errors after 1-2 days of normal usage.

Checking the docker logs there are a lot of these entries present:

relay-1                                         | 2024-09-17T06:33:00.437945Z ERROR relay_server::services::health_check: Not enough memory, 32351698944 / 33568419840 (96.38% >= 95.00%)
relay-1                                         | 2024-09-17T06:33:00.437982Z ERROR relay_server::services::health_check: Health check probe 'system memory' failed

but checking either htop we have enough RAM Image

as well as checking docker container stats there are no containers using > 95% RAM

CONTAINER ID   NAME                                                                CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
48244b39b406   sentry-self-hosted-nginx-1                                          0.10%     11.74MiB / 31.26GiB   0.04%     829MB / 832MB     11.8MB / 131kB    13
4188cd46c1fe   sentry-self-hosted-relay-1                                          0.54%     515.6MiB / 31.26GiB   1.61%     896MB / 2.06GB    363MB / 284MB     40
ca96eedda64d   sentry-self-hosted-generic-metrics-consumer-1                       0.56%     342.4MiB / 31.26GiB   1.07%     177MB / 290MB     17.8MB / 82MB     21
5dedcba8621c   sentry-self-hosted-monitors-clock-tick-1                            0.28%     162.4MiB / 31.26GiB   0.51%     36MB / 33.5MB     35.8MB / 29.5MB   6
8e98edce4698   sentry-self-hosted-subscription-consumer-generic-metrics-1          0.28%     323.9MiB / 31.26GiB   1.01%     37.7MB / 34.5MB   13.2MB / 68.7MB   13
ca542ffd958a   sentry-self-hosted-attachments-consumer-1                           0.48%     500.2MiB / 31.26GiB   1.56%     16.9MB / 15.2MB   26.8MB / 66.6MB   19
b300b32205a5   sentry-self-hosted-snuba-replacer-1                                 0.28%     115.6MiB / 31.26GiB   0.36%     35MB / 31.5MB     20.2MB / 68.9MB   5
1a77958a745c   sentry-self-hosted-ingest-monitors-1                                0.37%     169.8MiB / 31.26GiB   0.53%     17.2MB / 15.8MB   41.3MB / 19.8MB   11
b089c100ada2   sentry-self-hosted-worker-1                                         5.47%     1.444GiB / 31.26GiB   4.62%     9.41GB / 13.3GB   238MB / 126MB     227
8a69077b8025   sentry-self-hosted-snuba-replays-consumer-1                         0.46%     157.2MiB / 31.26GiB   0.49%     35.4MB / 31.9MB   2.92MB / 123MB    30
d77331751542   sentry-self-hosted-events-consumer-1                                0.37%     281.7MiB / 31.26GiB   0.88%     974MB / 1.02GB    18.1MB / 114MB    17
9414826f3339   sentry-self-hosted-subscription-consumer-transactions-1             0.26%     258.9MiB / 31.26GiB   0.81%     37.3MB / 34MB     27.9MB / 137MB    13
e59e3a288fbf   sentry-self-hosted-vroom-1                                          0.00%     11.75MiB / 31.26GiB   0.04%     233kB / 0B        34.8MB / 4.35MB   11
0acb817cf62d   sentry-self-hosted-snuba-subscription-consumer-events-1             0.39%     147.5MiB / 31.26GiB   0.46%     40.9MB / 34.8MB   7.13MB / 23.4MB   9
f996440e6e0d   sentry-self-hosted-post-process-forwarder-issue-platform-1          0.57%     274.9MiB / 31.26GiB   0.86%     71.8MB / 65.6MB   22.7MB / 121MB    18
c9578898e6f1   sentry-self-hosted-sentry-cleanup-1                                 0.00%     7.328MiB / 31.26GiB   0.02%     260kB / 28.5kB    136MB / 557kB     6
c148234ce38f   sentry-self-hosted-subscription-consumer-events-1                   0.27%     226.4MiB / 31.26GiB   0.71%     36.4MB / 33.1MB   16.8MB / 171MB    13
4651f280662d   sentry-self-hosted-metrics-consumer-1                               0.50%     339.4MiB / 31.26GiB   1.06%     34.8MB / 31.4MB   21.2MB / 65.4MB   21
91bc0382285e   sentry-self-hosted-ingest-profiles-1                                0.27%     155.3MiB / 31.26GiB   0.49%     33.5MB / 30MB     15.7MB / 37MB     6
82f8557caafa   sentry-self-hosted-snuba-metrics-consumer-1                         0.55%     191.5MiB / 31.26GiB   0.60%     34.6MB / 31.4MB   2.88MB / 96.5MB   34
1c7c176be51e   sentry-self-hosted-transactions-consumer-1                          0.40%     347.4MiB / 31.26GiB   1.09%     30.6MB / 29.6MB   26.3MB / 42.5MB   17
7f04ed421f85   sentry-self-hosted-post-process-forwarder-errors-1                  0.69%     350MiB / 31.26GiB     1.09%     32.8MB / 21.3MB   18.2MB / 43.4MB   23
d8ad278dcd0a   sentry-self-hosted-ingest-occurrences-1                             0.52%     146.3MiB / 31.26GiB   0.46%     36.9MB / 33.3MB   31.5MB / 58.7MB   16
a77658238b91   sentry-self-hosted-snuba-subscription-consumer-metrics-1            0.39%     132.6MiB / 31.26GiB   0.41%     36.2MB / 33.2MB   9.47MB / 41.3MB   9
8c9d85e099b8   sentry-self-hosted-web-1                                            0.12%     734.6MiB / 31.26GiB   2.29%     90.4MB / 332MB    346MB / 194MB     41
74ad5875b983   sentry-self-hosted-monitors-clock-tasks-1                           0.25%     147.3MiB / 31.26GiB   0.46%     35.3MB / 31.9MB   16.9MB / 49MB     6
12994ef232b6   sentry-self-hosted-billing-metrics-consumer-1                       0.46%     157.6MiB / 31.26GiB   0.49%     63.6MB / 37.2MB   13.5MB / 36MB     9
f7f6a96c18ec   sentry-self-hosted-ingest-replay-recordings-1                       0.43%     159.8MiB / 31.26GiB   0.50%     36.2MB / 32.7MB   21.8MB / 35.5MB   13
c0d724db862f   sentry-self-hosted-snuba-issue-occurrence-consumer-1                0.55%     333.4MiB / 31.26GiB   1.04%     34.8MB / 31.3MB   25.1MB / 61.4MB   41
37fb6a1dbf5e   sentry-self-hosted-cron-1                                           0.00%     179.1MiB / 31.26GiB   0.56%     17.4MB / 146MB    28.9MB / 45.1MB   3
6c46a69fa981   sentry-self-hosted-snuba-outcomes-billing-consumer-1                0.35%     200.9MiB / 31.26GiB   0.63%     18.3MB / 16.5MB   11MB / 49.2MB     26
564d6cedc04a   sentry-self-hosted-post-process-forwarder-transactions-1            0.64%     397.1MiB / 31.26GiB   1.24%     5.76GB / 462MB    21.2MB / 104MB    23
f135db607aec   sentry-self-hosted-subscription-consumer-metrics-1                  0.28%     298.9MiB / 31.26GiB   0.93%     36.5MB / 33.3MB   19.1MB / 94MB     13
c3ebde813802   sentry-self-hosted-snuba-transactions-consumer-1                    0.47%     322.3MiB / 31.26GiB   1.01%     21.4MB / 17MB     21.4MB / 60.4MB   37
4e4475215c07   sentry-self-hosted-ingest-feedback-events-1                         0.39%     238.4MiB / 31.26GiB   0.74%     35.7MB / 32.1MB   12.8MB / 156MB    15
d266f38dc1aa   sentry-self-hosted-snuba-spans-consumer-1                           0.35%     179MiB / 31.26GiB     0.56%     142MB / 582MB     11.3MB / 101MB    26
756587f4a091   sentry-self-hosted-snuba-generic-metrics-gauges-consumer-1          0.54%     181.1MiB / 31.26GiB   0.57%     117MB / 37.3MB    3.19MB / 120MB    34
ea958e5724d3   sentry-self-hosted-snuba-profiling-profiles-consumer-1              0.34%     128.8MiB / 31.26GiB   0.40%     35MB / 31.5MB     2.27MB / 120MB    26
cc77019cca49   sentry-self-hosted-symbolicator-cleanup-1                           0.00%     4.367MiB / 31.26GiB   0.01%     234kB / 0B        35.2MB / 0B       6
8c1ca1507b75   sentry-self-hosted-snuba-profiling-functions-consumer-1             0.34%     134.2MiB / 31.26GiB   0.42%     35MB / 31.5MB     2.53MB / 116MB    26
c479e6353a01   sentry-self-hosted-snuba-generic-metrics-counters-consumer-1        0.57%     255.2MiB / 31.26GiB   0.80%     18MB / 16.3MB     10.8MB / 23.2MB   34
a64b4393273e   sentry-self-hosted-snuba-errors-consumer-1                          0.57%     233.1MiB / 31.26GiB   0.73%     360MB / 317MB     4.6MB / 57.9MB    34
475833b0af89   sentry-self-hosted-snuba-generic-metrics-sets-consumer-1            0.67%     213.9MiB / 31.26GiB   0.67%     118MB / 41.6MB    6.73MB / 95.7MB   34
8a412f926058   sentry-self-hosted-snuba-generic-metrics-distributions-consumer-1   0.56%     270.6MiB / 31.26GiB   0.85%     124MB / 559MB     4.73MB / 60.4MB   34
4be03fe74a97   sentry-self-hosted-snuba-outcomes-consumer-1                        0.33%     162.1MiB / 31.26GiB   0.51%     33.8MB / 30.4MB   2.17MB / 92.4MB   26
f9689ba0b412   sentry-self-hosted-snuba-group-attributes-consumer-1                0.47%     322MiB / 31.26GiB     1.01%     34.8MB / 31.7MB   17.9MB / 86.7MB   37
acb9904b4aa4   sentry-self-hosted-snuba-subscription-consumer-transactions-1       0.38%     123.5MiB / 31.26GiB   0.39%     42.5MB / 36.3MB   8.68MB / 41.4MB   9
c44eb35c0248   sentry-self-hosted-vroom-cleanup-1                                  0.00%     3.602MiB / 31.26GiB   0.01%     234kB / 0B        8.87MB / 0B       6
e896238b7399   sentry-self-hosted-memcached-1                                      0.03%     23.07MiB / 31.26GiB   0.07%     673MB / 1.81GB    8.15MB / 2.45MB   10
f5170f49f238   sentry-self-hosted-snuba-api-1                                      0.05%     113.7MiB / 31.26GiB   0.36%     9.83MB / 16.3MB   66.3MB / 68.9MB   5
f7f090a153a3   sentry-self-hosted-symbolicator-1                                   0.00%     35MiB / 31.26GiB      0.11%     307kB / 59.4kB    25.5MB / 142MB    38
dec490aea24f   sentry-self-hosted-smtp-1                                           0.00%     1.371MiB / 31.26GiB   0.00%     255kB / 15.7kB    28.8MB / 4.1kB    2
3e57a3611024   sentry-self-hosted-postgres-1                                       0.01%     231.6MiB / 31.26GiB   0.72%     1.53GB / 728MB    17.6GB / 13.2MB   53
c5eaa992ea9b   sentry-self-hosted-kafka-1                                          1.97%     1.257GiB / 31.26GiB   4.02%     5.5GB / 11.8GB    1.73GB / 461MB    111
4bac36975fe0   sentry-self-hosted-clickhouse-1                                     0.39%     469MiB / 31.26GiB     1.47%     1.71GB / 83.7MB   3.21GB / 93.3MB   481
f1f35f95d7a7   sentry-self-hosted-redis-1                                          0.17%     56.23MiB / 31.26GiB   0.18%     11.9GB / 7.22GB   763MB / 2.97MB    5

docker_compose_logs.txt latest_install_logs.txt

Event ID

No response

LordSimal avatar Sep 17 '24 06:09 LordSimal

Duplicate of https://github.com/getsentry/self-hosted/issues/3327

LordSimal avatar Sep 17 '24 06:09 LordSimal

I think it's a memory leak.

barisyild avatar Sep 17 '24 10:09 barisyild

What I can confirm is the fact, that doing the upgrade to 24.9.0 did NOT fix the issue. After a few hours of incoming events I repeatedly get the same

relay-1                                         | 2024-09-17T14:16:40.186772Z ERROR relay_server::services::health_check: Not enough memory, 32202633216 / 33568419840 (95.93% >= 95.00%)
relay-1                                         | 2024-09-17T14:16:40.186811Z ERROR relay_server::services::health_check: Health check probe 'system memory' failed
relay-1                                         | 2024-09-17T14:16:40.449995Z ERROR relay_server::endpoints::common: error handling request error=failed to queue envelope

error in the docker compose logs.

LordSimal avatar Sep 17 '24 14:09 LordSimal

What does the event volume look like for you? Did this start happening after upgrading to 24.8.0?

hubertdeng123 avatar Sep 17 '24 23:09 hubertdeng123

We did the 24.8.0 Sentry update on the 1st of September. This is our stats page for the last 30 days Image

As you can see there are sections where it works fine but then sometimes for a few hours, sometimes even for days there are no events being processed.

Image

LordSimal avatar Sep 18 '24 06:09 LordSimal

Could you track your RAM/CPU usage as well? Wondering if there is a correlation there.

hubertdeng123 avatar Sep 19 '24 23:09 hubertdeng123

I can also see errors related to https://github.com/getsentry/snuba/issues/5707 in my logs

postgres-1                                      | 2024-09-20 11:04:34.677 UTC [660670] ERROR:  duplicate key value violates unique constraint "sentry_environmentprojec_project_id_environment_i_91da82f2_uniq"
postgres-1                                      | 2024-09-20 11:04:34.677 UTC [660670] DETAIL:  Key (project_id, environment_id)=(76, 1) already exists.
postgres-1                                      | 2024-09-20 11:04:34.677 UTC [660670] STATEMENT:  INSERT INTO "sentry_environmentproject" ("project_id", "environment_id", "is_hidden") VALUES (76, 1, NULL) RETURNING "sentry_environmentproject"."id"
postgres-1                                      | 2024-09-20 11:04:34.692 UTC [660670] ERROR:  duplicate key value violates unique constraint "sentry_grouprelease_group_id_release_id_envi_044354c8_uniq"
postgres-1                                      | 2024-09-20 11:04:34.692 UTC [660670] DETAIL:  Key (group_id, release_id, environment)=(385, 413, production) already exists.
postgres-1                                      | 2024-09-20 11:04:34.692 UTC [660670] STATEMENT:  INSERT INTO "sentry_grouprelease" ("project_id", "group_id", "release_id", "environment", "first_seen", "last_seen") VALUES (76, 385, 413, 'production', '2024-09-20T11:04:33.517322+00:00'::timestamptz, '2024-09-20T11:04:33.517322+00:00'::timestamptz) RETURNING "sentry_grouprelease"."id"
postgres-1                                      | 2024-09-20 11:04:34.699 UTC [660673] ERROR:  duplicate key value violates unique constraint "sentry_grouprelease_group_id_release_id_envi_044354c8_uniq"
postgres-1                                      | 2024-09-20 11:04:34.699 UTC [660673] DETAIL:  Key (group_id, release_id, environment)=(6909, 413, production) already exists.
postgres-1                                      | 2024-09-20 11:04:34.699 UTC [660673] STATEMENT:  INSERT INTO "sentry_grouprelease" ("project_id", "group_id", "release_id", "environment", "first_seen", "last_seen") VALUES (76, 6909, 413, 'production', '2024-09-20T11:04:33.688391+00:00'::timestamptz, '2024-09-20T11:04:33.688391+00:00'::timestamptz) RETURNING "sentry_grouprelease"."id"
nginx-1                                         | 144.208.193.56 - - [20/Sep/2024:11:04:38 +0000] "POST /api/49/envelope/ HTTP/1.0" 200 41 "-" "sentry.php.wordpress/8.1.0" "91.227.205.222"
nginx-1                                         | 144.208.193.56 - - [20/Sep/2024:11:04:38 +0000] "POST /api/2/envelope/ HTTP/1.0" 200 41 "-" "sentry.php/4.9.0" "2a01:aea0:df3:1::153"
nginx-1                                         | 144.208.193.56 - - [20/Sep/2024:11:04:38 +0000] "POST /api/2/envelope/ HTTP/1.0" 200 41 "-" "sentry.php/4.9.0" "2a01:aea0:df3:1::153"
nginx-1                                         | 144.208.193.56 - - [20/Sep/2024:11:04:38 +0000] "POST /api/2/envelope/ HTTP/1.0" 200 41 "-" "sentry.php/4.9.0" "2a01:aea0:df3:1::153"
postgres-1                                      | 2024-09-20 11:04:39.833 UTC [660680] ERROR:  duplicate key value violates unique constraint "sentry_environmentprojec_project_id_environment_i_91da82f2_uniq"
postgres-1                                      | 2024-09-20 11:04:39.833 UTC [660680] DETAIL:  Key (project_id, environment_id)=(49, 14) already exists.
postgres-1                                      | 2024-09-20 11:04:39.833 UTC [660680] STATEMENT:  INSERT INTO "sentry_environmentproject" ("project_id", "environment_id", "is_hidden") VALUES (49, 14, NULL) RETURNING "sentry_environmentproject"."id"
clickhouse-1                                    | 2024.09.20 11:04:40.173549 [ 188178 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
clickhouse-1                                    | 
clickhouse-1                                    | 0. Poco::Net::SocketImpl::error(int, String const&) @ 0x0000000015b3dbf2 in /usr/bin/clickhouse
clickhouse-1                                    | 1. Poco::Net::SocketImpl::peerAddress() @ 0x0000000015b40376 in /usr/bin/clickhouse
clickhouse-1                                    | 2. DB::HTTPServerRequest::HTTPServerRequest(std::shared_ptr<DB::IHTTPContext>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x0000000013154417 in /usr/bin/clickhouse
clickhouse-1                                    | 3. DB::HTTPServerConnection::run() @ 0x0000000013152ba4 in /usr/bin/clickhouse
clickhouse-1                                    | 4. Poco::Net::TCPServerConnection::start() @ 0x0000000015b42834 in /usr/bin/clickhouse
clickhouse-1                                    | 5. Poco::Net::TCPServerDispatcher::run() @ 0x0000000015b43a31 in /usr/bin/clickhouse
clickhouse-1                                    | 6. Poco::PooledThread::run() @ 0x0000000015c7a667 in /usr/bin/clickhouse
clickhouse-1                                    | 7. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000015c7893c in /usr/bin/clickhouse
clickhouse-1                                    | 8. ? @ 0x00007f2c86ec0609 in ?
clickhouse-1                                    | 9. ? @ 0x00007f2c86de5353 in ?
clickhouse-1                                    |  (version 23.8.11.29.altinitystable (altinity build))
nginx-1                                         | 144.208.193.56 - - [20/Sep/2024:11:04:45 +0000] "POST /api/14/envelope/ HTTP/1.0" 200 41 "-" "sentry.php.wordpress/8.1.0" "91.227.205.222"
nginx-1                                         | 144.208.193.56 - - [20/Sep/2024:11:04:45 +0000] "POST /api/2/envelope/ HTTP/1.0" 200 41 "-" "sentry.php/4.9.0" "2a01:aea0:df3:1::153"
nginx-1                                         | 144.208.193.56 - - [20/Sep/2024:11:04:46 +0000] "POST /api/14/envelope/ HTTP/1.0" 200 41 "-" "sentry.php.wordpress/8.1.0" "91.227.205.222"
nginx-1                                         | 144.208.193.56 - - [20/Sep/2024:11:04:48 +0000] "POST /api/14/envelope/ HTTP/1.0" 200 41 "-" "sentry.php.wordpress/8.1.0" "91.227.205.222"
clickhouse-1                                    | 2024.09.20 11:04:48.941366 [ 188178 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
clickhouse-1                                    | 
clickhouse-1                                    | 0. Poco::Net::SocketImpl::error(int, String const&) @ 0x0000000015b3dbf2 in /usr/bin/clickhouse
clickhouse-1                                    | 1. Poco::Net::SocketImpl::peerAddress() @ 0x0000000015b40376 in /usr/bin/clickhouse
clickhouse-1                                    | 2. DB::ReadBufferFromPocoSocket::ReadBufferFromPocoSocket(Poco::Net::Socket&, unsigned long) @ 0x000000000c896cc6 in /usr/bin/clickhouse
clickhouse-1                                    | 3. DB::HTTPServerRequest::HTTPServerRequest(std::shared_ptr<DB::IHTTPContext>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x000000001315451b in /usr/bin/clickhouse
clickhouse-1                                    | 4. DB::HTTPServerConnection::run() @ 0x0000000013152ba4 in /usr/bin/clickhouse
clickhouse-1                                    | 5. Poco::Net::TCPServerConnection::start() @ 0x0000000015b42834 in /usr/bin/clickhouse
clickhouse-1                                    | 6. Poco::Net::TCPServerDispatcher::run() @ 0x0000000015b43a31 in /usr/bin/clickhouse
clickhouse-1                                    | 7. Poco::PooledThread::run() @ 0x0000000015c7a667 in /usr/bin/clickhouse
clickhouse-1                                    | 8. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000015c7893c in /usr/bin/clickhouse
clickhouse-1                                    | 9. ? @ 0x00007f2c86ec0609 in ?
clickhouse-1                                    | 10. ? @ 0x00007f2c86de5353 in ?
clickhouse-1                                    |  (version 23.8.11.29.altinitystable (altinity build))

LordSimal avatar Sep 20 '24 11:09 LordSimal

I will have to try and get some system metric info stats running to give you the requested info

LordSimal avatar Sep 20 '24 11:09 LordSimal

Hey @LordSimal can you try this:

On your relay/config.yml file (https://github.com/getsentry/self-hosted/blob/master/relay/config.example.yml) add a health section, so it'd be:

relay:
  upstream: "http://web:9000/"
  host: 0.0.0.0
  port: 3000
logging:
  level: WARN
processing:
  enabled: true
  kafka_config:
    - {name: "bootstrap.servers", value: "kafka:9092"}
    - {name: "message.max.bytes", value: 50000000} # 50MB
  redis: redis://redis:6379
  geoip_path: "/geoip/GeoLite2-City.mmdb"
health:
  max_memory_percent: 1.0

Then do sudo docker compose up -d relay (or sudo docker compose --env-file .env.custom up -d relay), if things didn't change, try restarting the relay container.

Thanks to @Dav1dde

aldy505 avatar Sep 21 '24 00:09 aldy505

Hey @LordSimal can you try this:

On your relay/config.yml file (https://github.com/getsentry/self-hosted/blob/master/relay/config.example.yml) add a health section, so it'd be:

relay: upstream: "http://web:9000/" host: 0.0.0.0 port: 3000 logging: level: WARN processing: enabled: true kafka_config: - {name: "bootstrap.servers", value: "kafka:9092"} - {name: "message.max.bytes", value: 50000000} # 50MB redis: redis://redis:6379 geoip_path: "/geoip/GeoLite2-City.mmdb" health: max_memory_percent: 1.0 Then do sudo docker compose up -d relay (or sudo docker compose --env-file .env.custom up -d relay), if things didn't change, try restarting the relay container.

Thanks to @Dav1dde

Interesting, If you don't mind could you elaborate the changes? is it the same with setup a resource limits to each container ?

bijancot avatar Sep 21 '24 07:09 bijancot

Just wanna post my current stats of the last 3 days before I do this change

Image

The 21st is not shown in the stats here for some reason but there haven't been any events since yesterday 1PM (1 day 7 hours)

Image

I don't really understand the RAM Usage graph here since htop says there is only 13,7 GB used of 31,3 GB Image

But maybe NGINX Amplify uses a different RAM usage stat than htop.

Do we really have a RAM usage error here? Is 32GB for sentry not enough? This worked fine in older versions with exactly this server.

LordSimal avatar Sep 21 '24 18:09 LordSimal

Adjusted the relay/config.yml and executed docker compose up -d relayto restart the relay container.

Nothing changed till now even though there are 100% events which should be coming in.

Should I try to just run ./install.sh again to do a "fresh restart"? I think this worked in the past.

LordSimal avatar Sep 21 '24 18:09 LordSimal

Just wanna post my current stats of the last 3 days before I do this change

Image

The 21st is not shown in the stats here for some reason but there haven't been any events since yesterday 1PM (1 day 7 hours)

Image

I don't really understand the RAM Usage graph here since htop says there is only 13,7 GB used of 31,3 GB Image

But maybe NGINX Amplify uses a different RAM usage stat than htop.

Do we really have a RAM usage error here? Is 32GB for sentry not enough? This worked fine in older versions with exactly this server.

What is java process?

barisyild avatar Sep 21 '24 21:09 barisyild

What is java process?

Its the kafka process

root@scarecrow:~# ps -ax | grep java
   1969 pts/0    S+     0:00 grep java
  17703 ?        Ssl  183:39 java -Xmx1G -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xlog:gc*:file=/var/log/kafka/kafkaServer-gc.log:time,tags:filecount=10,filesize=100M -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/log/kafka -Dlog4j.configuration=file:/etc/kafka/log4j.properties -cp /usr/bin/../share/java/kafka/*:/usr/bin/../share/java/confluent-telemetry/* kafka.Kafka /etc/kafka/kafka.properties

LordSimal avatar Sep 22 '24 08:09 LordSimal

I got some news.... I just executed

docker compose down
./install.sh
docker compose up -d

and suddenly the stats page has updated and there seems to be events present which were not previously...

Image

Also events are being processed right now and my sever is pinned at a 100% usage

Image

Seems like something prevented the queue worker to process the queued events.

After around ~15 minutes all queued up events seem to have been processed and now the load is normal again. Also new sentry events are coming up pretty much instantly in the UI as they have been in the past.

Will wait now if the problem re-occurs again.

LordSimal avatar Sep 22 '24 09:09 LordSimal

Interesting, If you don't mind could you elaborate the changes? is it the same with setup a resource limits to each container ?

The max_memory_percent: 1.0 just effectively disables the memory check, it would only fail if 100% of the OS resources are used, at this point you'd have the OOM killer already kick in.

Created a Relay issue to investigate why Relay seems to think more memory is used than it actually is: https://github.com/getsentry/relay/issues/4059

Dav1dde avatar Sep 23 '24 06:09 Dav1dde

Till now everything works fine again.

I do indeed have the relay config set as you requested and no memory errors have been reported yet by the relay container.

root@scarecrow:~/sentry# docker compose logs -n 1000 relay
relay-1  | 2024-09-22T09:08:52.572625Z ERROR relay_server::services::health_check: Health check probe 'auth' failed
relay-1  | 2024-09-22T11:17:56.187838Z ERROR relay_server::services::project_upstream: error fetching project states error=could not send request to upstream error.sources=[error sending request for url (http://web:9000/api/0/relays/projectconfigs/?version=3), client error (Connect), operation timed out] attempts=2
relay-1  | 2024-09-22T11:18:02.993599Z  WARN relay_server::services::upstream: network outage, scheduling another check in 0ns
relay-1  | 2024-09-22T11:18:02.993674Z ERROR relay_server::services::project_upstream: error fetching project states error=could not send request to upstream error.sources=[error sending request for url (http://web:9000/api/0/relays/projectconfigs/?version=3), operation timed out] attempts=3
relay-1  | 2024-09-22T11:33:29.321590Z  WARN relay_server::services::upstream: network outage, scheduling another check in 0ns
relay-1  | 2024-09-22T11:33:34.361859Z  WARN relay_server::services::upstream: network outage, scheduling another check in 1s
relay-1  | 2024-09-22T15:43:05.212806Z ERROR relay_server::services::project_cache: failed to fetch project from Redis error=failed to talk to redis error.sources=[failed to communicate with redis, Resource temporarily unavailable (os error 11)]
relay-1  | 2024-09-23T00:00:42.685558Z  WARN relay_server::services::upstream: network outage, scheduling another check in 0ns
relay-1  | 2024-09-23T00:00:47.760235Z  WARN relay_server::services::upstream: network outage, scheduling another check in 1s
relay-1  | 2024-09-23T00:00:53.778282Z  WARN relay_server::services::upstream: network outage, scheduling another check in 1.5s
relay-1  | 2024-09-23T00:01:26.554169Z  WARN relay_server::services::upstream: network outage, scheduling another check in 0ns
relay-1  | 2024-09-23T00:01:41.630586Z ERROR relay_server::services::global_config: failed to fetch global config from upstream error=could not send request to upstream error.sources=[error sending request for url (http://web:9000/api/0/relays/projectconfigs/?version=3), operation timed out]
relay-1  | 2024-09-23T00:49:21.795499Z  WARN relay_server::services::upstream: network outage, scheduling another check in 0ns
relay-1  | 2024-09-23T00:49:26.810505Z  WARN relay_server::services::upstream: network outage, scheduling another check in 1s

LordSimal avatar Sep 23 '24 06:09 LordSimal

Interesting, If you don't mind could you elaborate the changes? is it the same with setup a resource limits to each container ?

The max_memory_percent: 1.0 just effectively disables the memory check, it would only fail if 100% of the OS resources are used, at this point you'd have the OOM killer already kick in.

Created a Relay issue to investigate why Relay seems to think more memory is used than it actually is: getsentry/relay#4059

Great, Thank you for explanation

bijancot avatar Sep 23 '24 06:09 bijancot

So its been nearly 24h of uninterrupted, working sentry... BUT it happened again. No events are being processed since 10 hours ago.

Image

As you can see this is sorted by Last seen

Here is my server stats report from the last 24h

Image

And here is the output of docker compose logs --since 12h > 12h-logs.txt

12h-logs.txt.gz

At the end of that file just have a bunch of

clickhouse-1                                    | 2024.09.23 11:24:50.987265 [ 65743 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
clickhouse-1                                    | 
clickhouse-1                                    | 0. Poco::Net::SocketImpl::error(int, String const&) @ 0x0000000015b3dbf2 in /usr/bin/clickhouse
clickhouse-1                                    | 1. Poco::Net::SocketImpl::peerAddress() @ 0x0000000015b40376 in /usr/bin/clickhouse
clickhouse-1                                    | 2. DB::ReadBufferFromPocoSocket::ReadBufferFromPocoSocket(Poco::Net::Socket&, unsigned long) @ 0x000000000c896cc6 in /usr/bin/clickhouse
clickhouse-1                                    | 3. DB::HTTPServerRequest::HTTPServerRequest(std::shared_ptr<DB::IHTTPContext>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x000000001315451b in /usr/bin/clickhouse
clickhouse-1                                    | 4. DB::HTTPServerConnection::run() @ 0x0000000013152ba4 in /usr/bin/clickhouse
clickhouse-1                                    | 5. Poco::Net::TCPServerConnection::start() @ 0x0000000015b42834 in /usr/bin/clickhouse
clickhouse-1                                    | 6. Poco::Net::TCPServerDispatcher::run() @ 0x0000000015b43a31 in /usr/bin/clickhouse
clickhouse-1                                    | 7. Poco::PooledThread::run() @ 0x0000000015c7a667 in /usr/bin/clickhouse
clickhouse-1                                    | 8. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000015c7893c in /usr/bin/clickhouse
clickhouse-1                                    | 9. ? @ 0x00007f10b149a609 in ?
clickhouse-1                                    | 10. ? @ 0x00007f10b13bf353 in ?
clickhouse-1                                    |  (version 23.8.11.29.altinitystable (altinity build))

LordSimal avatar Sep 23 '24 17:09 LordSimal

So its been nearly 24h of uninterrupted, working sentry... BUT it happened again. No events are being processed since 10 hours ago.

Image

As you can see this is sorted by Last seen

Here is my server stats report from the last 24h

Image

And here is the output of docker compose logs --since 12h > 12h-logs.txt

12h-logs.txt.gz

At the end of that file just have a bunch of

clickhouse-1                                    | 2024.09.23 11:24:50.987265 [ 65743 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
clickhouse-1                                    | 
clickhouse-1                                    | 0. Poco::Net::SocketImpl::error(int, String const&) @ 0x0000000015b3dbf2 in /usr/bin/clickhouse
clickhouse-1                                    | 1. Poco::Net::SocketImpl::peerAddress() @ 0x0000000015b40376 in /usr/bin/clickhouse
clickhouse-1                                    | 2. DB::ReadBufferFromPocoSocket::ReadBufferFromPocoSocket(Poco::Net::Socket&, unsigned long) @ 0x000000000c896cc6 in /usr/bin/clickhouse
clickhouse-1                                    | 3. DB::HTTPServerRequest::HTTPServerRequest(std::shared_ptr<DB::IHTTPContext>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x000000001315451b in /usr/bin/clickhouse
clickhouse-1                                    | 4. DB::HTTPServerConnection::run() @ 0x0000000013152ba4 in /usr/bin/clickhouse
clickhouse-1                                    | 5. Poco::Net::TCPServerConnection::start() @ 0x0000000015b42834 in /usr/bin/clickhouse
clickhouse-1                                    | 6. Poco::Net::TCPServerDispatcher::run() @ 0x0000000015b43a31 in /usr/bin/clickhouse
clickhouse-1                                    | 7. Poco::PooledThread::run() @ 0x0000000015c7a667 in /usr/bin/clickhouse
clickhouse-1                                    | 8. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000015c7893c in /usr/bin/clickhouse
clickhouse-1                                    | 9. ? @ 0x00007f10b149a609 in ?
clickhouse-1                                    | 10. ? @ 0x00007f10b13bf353 in ?
clickhouse-1                                    |  (version 23.8.11.29.altinitystable (altinity build))

I have a theory that the java process may not be calling garbage collection because ram is high, which may indirectly cause the sentry to not be able to allocate the memory it wants.

barisyild avatar Sep 23 '24 17:09 barisyild

well something inside sentry changed with 24.8.0 to cause this. Can I downgrade sentry to 24.7.1 to test this?

LordSimal avatar Sep 23 '24 18:09 LordSimal

So its been nearly 24h of uninterrupted, working sentry... BUT it happened again. No events are being processed since 10 hours ago.

Image

As you can see this is sorted by Last seen

Here is my server stats report from the last 24h

Image

@LordSimal By any chance, do you have any swapfile configured? If you do, how many GBs are allocated for swap? I can only see 32 GB is allocated for regular RAM.

And here is the output of docker compose logs --since 12h > 12h-logs.txt

12h-logs.txt.gz

At the end of that file just have a bunch of

clickhouse-1                                    | 2024.09.23 11:24:50.987265 [ 65743 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
clickhouse-1                                    | 
clickhouse-1                                    | 0. Poco::Net::SocketImpl::error(int, String const&) @ 0x0000000015b3dbf2 in /usr/bin/clickhouse
clickhouse-1                                    | 1. Poco::Net::SocketImpl::peerAddress() @ 0x0000000015b40376 in /usr/bin/clickhouse
clickhouse-1                                    | 2. DB::ReadBufferFromPocoSocket::ReadBufferFromPocoSocket(Poco::Net::Socket&, unsigned long) @ 0x000000000c896cc6 in /usr/bin/clickhouse
clickhouse-1                                    | 3. DB::HTTPServerRequest::HTTPServerRequest(std::shared_ptr<DB::IHTTPContext>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x000000001315451b in /usr/bin/clickhouse
clickhouse-1                                    | 4. DB::HTTPServerConnection::run() @ 0x0000000013152ba4 in /usr/bin/clickhouse
clickhouse-1                                    | 5. Poco::Net::TCPServerConnection::start() @ 0x0000000015b42834 in /usr/bin/clickhouse
clickhouse-1                                    | 6. Poco::Net::TCPServerDispatcher::run() @ 0x0000000015b43a31 in /usr/bin/clickhouse
clickhouse-1                                    | 7. Poco::PooledThread::run() @ 0x0000000015c7a667 in /usr/bin/clickhouse
clickhouse-1                                    | 8. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000015c7893c in /usr/bin/clickhouse
clickhouse-1                                    | 9. ? @ 0x00007f10b149a609 in ?
clickhouse-1                                    | 10. ? @ 0x00007f10b13bf353 in ?
clickhouse-1                                    |  (version 23.8.11.29.altinitystable (altinity build))

ClickHouse logs is not an issue, it's just saying that connection is closed prematurely by the client. Nothing harmful from that (but if you do centralized logging and output every syslog to that server, yes, you'll have a disk space problem on your centralized logging server). The issue for this is here: https://github.com/getsentry/snuba/issues/5707

Although from your logs, I'm seeing something weird related to your Kafka.

kafka-1                                         | [2024-09-23 07:42:35,763] WARN [GroupCoordinator 1001]: Failed to write empty metadata for group snuba-spans-consumers: This is not the correct coordinator. (kafka.coordinator.group.GroupCoordinator)
kafka-1                                         | [2024-09-23 07:42:35,763] WARN [GroupCoordinator 1001]: Failed to write empty metadata for group snuba-consumers: This is not the correct coordinator. (kafka.coordinator.group.GroupCoordinator)
events-consumer-1                               |                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1                               |   File "/.venv/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
events-consumer-1                               |     return ctx.invoke(self.callback, **ctx.params)
events-consumer-1                               |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1                               |   File "/.venv/lib/python3.12/site-packages/click/core.py", line 783, in invoke
events-consumer-1                               |     return __callback(*args, **kwargs)
events-consumer-1                               |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1                               |   File "/.venv/lib/python3.12/site-packages/click/decorators.py", line 33, in new_func
events-consumer-1                               |     return f(get_current_context(), *args, **kwargs)
events-consumer-1                               |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1                               |   File "/usr/src/sentry/src/sentry/runner/decorators.py", line 83, in inner
events-consumer-1                               |     return ctx.invoke(f, *args, **kwargs)
events-consumer-1                               |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1                               |   File "/.venv/lib/python3.12/site-packages/click/core.py", line 783, in invoke
events-consumer-1                               |     return __callback(*args, **kwargs)
events-consumer-1                               |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1                               |   File "/.venv/lib/python3.12/site-packages/click/decorators.py", line 33, in new_func
events-consumer-1                               |     return f(get_current_context(), *args, **kwargs)
events-consumer-1                               |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1                               |   File "/usr/src/sentry/src/sentry/runner/decorators.py", line 35, in inner
events-consumer-1                               |     return ctx.invoke(f, *args, **kwargs)
events-consumer-1                               |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1                               |   File "/.venv/lib/python3.12/site-packages/click/core.py", line 783, in invoke
events-consumer-1                               |     return __callback(*args, **kwargs)
events-consumer-1                               |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1                               |   File "/usr/src/sentry/src/sentry/runner/commands/run.py", line 386, in basic_consumer
events-consumer-1                               |     run_processor_with_signals(processor, consumer_name)
events-consumer-1                               |   File "/usr/src/sentry/src/sentry/utils/kafka.py", line 46, in run_processor_with_signals
events-consumer-1                               |     processor.run()
events-consumer-1                               |   File "/.venv/lib/python3.12/site-packages/arroyo/processing/processor.py", line 322, in run
events-consumer-1                               |     self._run_once()
events-consumer-1                               |   File "/.venv/lib/python3.12/site-packages/arroyo/processing/processor.py", line 410, in _run_once
events-consumer-1                               |     self.__processing_strategy.submit(message)
events-consumer-1                               |   File "/.venv/lib/python3.12/site-packages/arroyo/processing/strategies/guard.py", line 82, in submit
events-consumer-1                               |     self.__inner_strategy.submit(message)
events-consumer-1                               |   File "/.venv/lib/python3.12/site-packages/arroyo/processing/strategies/run_task.py", line 52, in submit
events-consumer-1                               |     self.__next_step.submit(Message(value))
events-consumer-1                               |   File "/.venv/lib/python3.12/site-packages/arroyo/processing/strategies/guard.py", line 34, in submit
events-consumer-1                               |     self.__next_step.submit(message)
events-consumer-1                               |   File "/.venv/lib/python3.12/site-packages/arroyo/processing/strategies/guard.py", line 82, in submit
events-consumer-1                               |     self.__inner_strategy.submit(message)
events-consumer-1                               |   File "/.venv/lib/python3.12/site-packages/arroyo/processing/strategies/run_task.py", line 52, in submit
events-consumer-1                               |     self.__next_step.submit(Message(value))
events-consumer-1                               |   File "/.venv/lib/python3.12/site-packages/arroyo/processing/strategies/guard.py", line 34, in submit
events-consumer-1                               |     self.__next_step.submit(message)
events-consumer-1                               |   File "/.venv/lib/python3.12/site-packages/arroyo/processing/strategies/commit.py", line 34, in submit
events-consumer-1                               |     self.__commit(message.committable)
events-consumer-1                               |   File "/.venv/lib/python3.12/site-packages/arroyo/processing/processor.py", line 308, in __commit
events-consumer-1                               |     self.__consumer.commit_offsets()
events-consumer-1                               |   File "/.venv/lib/python3.12/site-packages/arroyo/backends/kafka/consumer.py", line 609, in commit_offsets
events-consumer-1                               |     return self.__commit_retry_policy.call(self.__commit)
events-consumer-1                               |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1                               |   File "/.venv/lib/python3.12/site-packages/arroyo/utils/retries.py", line 88, in call
events-consumer-1                               |     return callable()
events-consumer-1                               |            ^^^^^^^^^^
events-consumer-1                               |   File "/.venv/lib/python3.12/site-packages/arroyo/backends/kafka/consumer.py", line 567, in __commit
events-consumer-1                               |     result = self.__consumer.commit(
events-consumer-1                               |              ^^^^^^^^^^^^^^^^^^^^^^^
events-consumer-1                               | cimpl.KafkaException: KafkaError{code=COORDINATOR_LOAD_IN_PROGRESS,val=14,str="Commit failed: Broker: Coordinator load in progress"}

The event-consumer container is a sentry consumer that's written in Python, by utilizing Django stuff. But, seeing from the logs, the problem may lays on the internal connection into your Kafka. Perhaps you can increase the nofile limits here: https://github.com/getsentry/self-hosted/blob/5bd6cd3710cc214b2d68858d24a7d6bcf8149d73/docker-compose.yml#L158-L161

Or.. you can migrate your Kafka into Redpanda, it's a drop in replacement, all you'll need to do is to re-run ./install.sh.

aldy505 avatar Sep 24 '24 00:09 aldy505

Or.. you can migrate your Kafka into Redpanda, it's a drop in replacement, all you'll need to do is to re-run ./install.sh.

Is it cannon? any env variable we should changes for implement it?

Regards, Baskoro

bijancot avatar Sep 24 '24 04:09 bijancot

I have adjusted the soft and hard ulimit values to 8192 and executed ./install.sh

again, the queue worker goes ham and processes everything, that has been queued up

Image

we will see what it will bring in the next 24h

LordSimal avatar Sep 24 '24 06:09 LordSimal

By any chance, do you have any swapfile configured? If you do, how many GBs are allocated for swap? I can only see 32 GB is allocated for regular RAM.

yes, we do have lots of swap available as you can see in my htop screenshots above (64GB)

LordSimal avatar Sep 24 '24 06:09 LordSimal

Let me know if I should create a new issue, but I have almost the same situation.

Self-Hosted Version 24.8.0

CPU Architecture x86_64

Docker Version 26.0.0

Docker Compose Version 2.25.0

For me Sentry stopped processing issues with version 24.5.0 a couple of months ago. When I restart it, it works for a couple of days and then stops processing again. Here is an example from last Saturday, when it failed during weekend.

Image

In docker compose logs, I can see lines like these:

relay-1 | 2024-09-23T15:24:03.996749240Z 2024-09-23T15:24:03.996335Z ERROR relay_server::services::health_check: Not enough memory, 16059187200 / 16773009408 (95.74% >= 95.00%)
relay-1 | 2024-09-23T15:24:06.999942277Z 2024-09-23T15:24:06.997660Z ERROR relay_server::services::health_check: Not enough memory, 15983144960 / 16773009408 (95.29% >= 95.00%)
relay-1 | 2024-09-23T15:24:58.060185500Z 2024-09-23T15:24:58.059995Z ERROR relay_server::services::health_check: Not enough memory, 16002109440 / 16773009408 (95.40% >= 95.00%)

But when I check the memory with htop, there is about 8/16GB of RAM used. I have also 16GB of swap and its usage varies between 0-10GB.

Yesterday I got it processing again by restarting the whole server. After that I upgraded 24.5.0 -> 24.8.0.

For the past couple of months we have been monitoring Kafka as it seems there is lag every time we have problems with processing. The oneliner used for monitoring:

docker exec sentry-self-hosted-kafka-1 kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group post-process-forwarder | awk -v ts=$(date +%s) 'NR > 1 {print $2 "," $6 "," ts}' | grep -v -e "TOPIC\|generic-events" >> /tmp/kafka_lag_report.csv

I'm not sure, if it's related, but in postgresql container logs I see these:

worker-1  | 2024-09-23T11:55:05.254675423Z sentry.models.environment.Environment.MultipleObjectsReturned: get() returned more than one Environment -- it returned 2!

And the worker logs are full of these:

worker-1                | 00:38:17 [ERROR] celery.app.trace: Task sentry.tasks.store.save_event_transaction[865f23da-0dfe-40c7-b360-63152829cf95] raised unexpected: MultipleObjectsReturned('get() returned more than one Environment -- it returned 2!') (data={'hostname': 'celery@5aab0274fd34', 'id': '865f23da-0dfe-40c7-b360-63152829cf95', 'name': 'sentry.tasks.store.save_event_transaction', 'exc': "MultipleObjectsReturned('get() returned more than one Environment -- it returned 2!')", 'traceback': 'Traceback (most recent call last):\n  File "/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 477, in trace_task\n    R = retval = fun(*args, **kwargs)\n                 ^^^^^^^^^^^^^^^^^^^^\n  File "/.venv/lib/python3.11/site-packages/sentry_sdk/utils.py", line 1720, in runner\n    return sentry_patched_function(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/.venv/lib/python3.11/site-packages/sentry_sdk/integrations/celery/__init__.py", line 406, in _inner\n    reraise(*exc_info)\n  File "/.venv/lib/python3.11/site-packages/sentry_sdk/utils.py", line 1649, in reraise\n    raise value\n  File "/.venv/lib/python3.11/site-packages/sentry_sdk/integrations/celery/__init__.py", line 401, in _inner\n    return f(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^\n  File "/.venv/lib/python3.11/site-packages/celery/app/trace.py", line 760, in __protected_call__\n    return self.run(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/usr/src/sentry/src/sentry/silo/base.py", line 148, in override\n    return original_method(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/usr/src/sentry/src/sentry/tasks/base.py", line 128, in _wrapped\n    result = func(*args, **kwargs)\n             ^^^^^^^^^^^^^^^^^^^^^\n  File "/usr/src/sentry/src/sentry/tasks/store.py", line 678, in save_event_transaction\n    _do_save_event(cache_key, data, start_time, event_id, project_id, **kwargs)\n  File "/usr/src/sentry/src/sentry/tasks/store.py", line 554, in _do_save_event\n    manager.save(\n  File "/.venv/lib/python3.11/site-packages/sentry_sdk/tracing_utils.py", line 679, in func_with_tracing\n    return func(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^\n  File "/usr/src/sentry/src/sentry/event_manager.py", line 502, in save\n    jobs = save_transaction_events([job], projects)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/.venv/lib/python3.11/site-packages/sentry_sdk/tracing_utils.py", line 679, in func_with_tracing\n    return func(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^\n  File "/usr/src/sentry/src/sentry/event_manager.py", line 3065, in save_transaction_events\n    _get_or_create_environment_many(jobs, projects)\n  File "/.venv/lib/python3.11/site-packages/sentry_sdk/tracing_utils.py", line 679, in func_with_tracing\n    return func(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^\n  File "/usr/src/sentry/src/sentry/event_manager.py", line 977, in _get_or_create_environment_many\n    job["environment"] = Environment.get_or_create(\n                         ^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/usr/src/sentry/src/sentry/models/environment.py", line 98, in get_or_create\n    env = cls.objects.get_or_create(name=name, organization_id=project.organization_id)[\n          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/usr/src/sentry/src/sentry/silo/base.py", line 148, in override\n    return original_method(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/.venv/lib/python3.11/site-packages/django/db/models/manager.py", line 87, in manager_method\n    return getattr(self.get_queryset(), name)(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/usr/src/sentry/src/sentry/silo/base.py", line 148, in override\n    return original_method(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/.venv/lib/python3.11/site-packages/django/db/models/query.py", line 948, in get_or_create\n    return self.get(**kwargs), False\n           ^^^^^^^^^^^^^^^^^^\n  File "/.venv/lib/python3.11/site-packages/django/db/models/query.py", line 652, in get\n    raise self.model.MultipleObjectsReturned(\nsentry.models.environment.Environment.MultipleObjectsReturned: get() returned more than one Environment -- it returned 2!\n', 'args': '()', 'kwargs': "{'cache_key': 'e:a8db55e3cc9149168cc68a3b81ab5c44:40', 'data': None, 'start_time': 1727138295.0, 'event_id': 'a8db55e3cc9149168cc68a3b81ab5c44', 'project_id': 40, '__start_time': 1727138297.471802}", 'description': 'raised unexpected', 'internal': False})

I listed duplicated environments with this script and there were many of those. I couldn't merge them with the other script. The latter script just gives me error:

raise TransactionMissingDBException("'using' must be specified when creating a transaction")
sentry.silo.patches.silo_aware_transaction_patch.TransactionMissingDBException: 'using' must be specified when creating a transaction

I haven't yet changed max_memory_percent or ulimit. Is there anything else I could try or shall I just repeat the same steps as LordSimal?

Tha-Fox avatar Sep 24 '24 13:09 Tha-Fox

Adjusting the ulimits seems to only have increased the duration of working sentry from 1 day to 2 days... It broke again today at 02 AM CEST

Here my logs from the last 12 hours (it happened 7h 40 minutes ago) 12h_logs.txt.gz

will execute ./install.sh to get it working again.

LordSimal avatar Sep 26 '24 07:09 LordSimal

Adjusting the ulimits seems to only have increased the duration of working sentry from 1 day to 2 days... It broke again today at 02 AM CEST

Here my logs from the last 12 hours (it happened 7h 40 minutes ago) 12h_logs.txt.gz

will execute ./install.sh to get it working again.

I still think the problem is the java process, can you restart the java process when the sentry breaks?

barisyild avatar Sep 26 '24 07:09 barisyild

I still think the problem is the java process, can you restart the java process when the sentry breaks?

So i should only restart the kafka container is what you are saying?

LordSimal avatar Sep 26 '24 07:09 LordSimal

Yeah

barisyild avatar Sep 26 '24 07:09 barisyild