Timeouts in large clickhouse eap_items_1_local
Self-Hosted Version
25.9.0
CPU Architecture
x86_64
Docker Version
28.4.0
Docker Compose Version
2.39.2
Machine Specification
- [x] My system meets the minimum system requirements of Sentry
Steps to Reproduce
- Run 25.9.0
- Large ingest over multiple days
- Open "new" insights backend page.
Expected Result
/event-stats api calls return reasonably fast results.
Actual Result
Clickhouse error log:
2025.09.20 17:54:32.714831 [ 859 ] {randomized-2507e81dae5f4761806b3bfb335617f9} <Error> executeQuery: Code: 159. DB::Exception: Timeout exceeded: elapsed 25005.287 ms, maximum: 25000 ms. (TIMEOUT_EXCEEDED) (version 25.3.6.10034.altinitystable (altinity build)) (from 172.18.0.23:49096) (in query: SELECT (if(mapContains(attributes_string_19, 'sentry.op'), arrayElement(attributes_string_19, 'sentry.op'), NULL) AS `sentry.op_TYPE_STRING`), (if(mapContains(attributes_string_39, 'sentry.group'), arrayElement(attributes_string_39, 'sentry.group'), NULL) AS `sentry.group_TYPE_STRING`), (cast(project_id, 'Int64') AS `sentry.project_id_TYPE_INT`), (if(mapContains(attributes_string_36, 'sentry.normalized_description'), arrayElement(attributes_string_36, 'sentry.normalized_description'), NULL) AS `sentry.normalized_description_TYPE_STRING`), (divide(sumIfOrNull(multiply((if(mapContains(attributes_float_5, 'sentry.duration_ms'), arrayElement(attributes_float_5, 'sentry.duration_ms'), NULL) AS `sentry.duration_ms_TYPE_DOUBLE`), divide(1, sampling_factor)), mapContains(attributes_float_5, 'sentry.duration_ms') AND true), sumIfOrNull(divide(1, sampling_factor), mapContains(attributes_float_5, 'sentry.duration_ms') AND true)) AS `avg(span.duration)`), (divide(abs(minus(divide(plus((sumIfOrNull(multiply(`sentry.duration_ms_TYPE_DOUBLE`, sampling_weight), mapContains(attributes_float_5, 'sentry.duration_ms') AND true) AS `__snuba_custom_column__confidence_interval$avg(span.duration)$function_type:avg__sum`), (multiply(2.24, sqrt(sumIf(multiply(multiply(`sentry.duration_ms_TYPE_DOUBLE`, `sentry.duration_ms_TYPE_DOUBLE`), minus(multiply(sampling_weight, sampling_weight), sampling_weight)), mapContains(attributes_float_5, 'sentry.duration_ms') AND true))) AS `__snuba_custom_column__confidence_interval$avg(span.duration)$function_type:avg__sum_err`)), minus((sumIfOrNull(sampling_weight, mapContains(attributes_float_5, 'sentry.duration_ms') AND true) AS `__snuba_custom_column__confidence_interval$avg(span.duration)$function_type:avg__count`), (multiply(2.24, sqrt(sumIf(minus(multiply(sampling_weight, sampling_weight), sampling_weight), mapContains(attributes_float_5, 'sentry.duration_ms') AND true))) AS `__snuba_custom_column__confidence_interval$avg(span.duration)$function_type:avg__count_err`))), divide(minus(`__snuba_custom_column__confidence_interval$avg(span.duration)$function_type:avg__sum`, `__snuba_custom_column__confidence_interval$avg(span.duration)$function_type:avg__sum_err`), plus(`__snuba_custom_column__confidence_interval$avg(span.duration)$function_type:avg__count`, `__snuba_custom_column__confidence_interval$avg(span.duration)$function_type:avg__count_err`)))), 2) AS `__snuba_custom_column__confidence_interval$avg(span.duration)$function_type:avg`), (divide(countIf(`sentry.duration_ms_TYPE_DOUBLE`, mapContains(attributes_float_5, 'sentry.duration_ms') AND true), sumIf(sampling_weight, mapContains(attributes_float_5, 'sentry.duration_ms') AND true)) AS `__snuba_custom_column__average_sample_rate$avg(span.duration)`), (countIf(`sentry.duration_ms_TYPE_DOUBLE`, mapContains(attributes_float_5, 'sentry.duration_ms') AND true) AS `__snuba_custom_column__count$avg(span.duration)`), (coalesce(if(mapContains(attributes_string_20, 'sentry.transaction'), arrayElement(attributes_string_20, 'sentry.transaction'), NULL), if(mapContains(attributes_string_3, 'transaction'), arrayElement(attributes_string_3, 'transaction'), NULL)) AS `sentry.transaction_TYPE_STRING`) FROM eap_items_1_local WHERE in(project_id, [2, 3, 5, 7, 8, 9, 10, 12, 13, 14, 17]) AND equals(organization_id, 1) AND less(timestamp, toDateTime(1758391201)) AND greaterOrEquals(timestamp, toDateTime(1755798301)) AND mapContains(attributes_string_6, 'db.system') AND (notEquals((if(mapContains(attributes_string_6, 'db.system'), arrayElement(attributes_string_6, 'db.system'), NULL) AS `db.system_TYPE_STRING`), '') OR xor(isNull(`db.system_TYPE_STRING`), isNull(''))) AND mapContains(attributes_string_39, 'sentry.group') AND (notEquals(`sentry.group_TYPE_STRING`, '') OR xor(isNull(`sentry.group_TYPE_STRING`), isNull(''))) AND mapContains(attributes_float_5, 'sentry.duration_ms') AND equals(item_type, 1) GROUP BY `sentry.op_TYPE_STRING`, `sentry.group_TYPE_STRING`, `sentry.project_id_TYPE_INT`, `sentry.normalized_description_TYPE_STRING`, `sentry.transaction_TYPE_STRING` ORDER BY `avg(span.duration)` DESC LIMIT 4 OFFSET 0), Stack trace (when copying this message, always include the lines below):
Web log:
web-1 | 17:53:07 [ERROR] django.request: Gateway Timeout: /api/0/organizations/last/events-stats/ (status_code=504 request=<WSGIRequest: GET '/api/0/organizations/last/events-stats/?dataset=spans&excludeOther=0&interval=1d&partial=1&per_page=50&query=span.op%3Aqueue.process&referrer=api.insights.backend.overview.jobs-chart&sampling=NORMAL&statsPeriod=30d&transformAliasToInputFormat=0&yAxis=trace_status_rate%28internal_error%29&yAxis=count%28span.duration%29'>)
Running the query by hand in clickhouse-client within the container yields:
4 rows in set. Elapsed: 18.297 sec. Processed 136.07 million rows, 64.32 GB (7.44 million rows/s., 3.52 GB/s.) Peak memory usage: 284.89 MiB.
Not the fastest disks in the world, but still a lot of processing.
Count of table:
SELECT count(*)
FROM eap_items_1_local
Query id: e1d48846-cbbb-4492-9ba7-c5dadfeca1db
┌───count()─┐
1. │ 176510259 │ -- 176.51 million
└───────────┘
Event ID
No response
I am in the same position as you. Insights backend is also unresponsive for me. I just have a "few" more entries in clickhouse as it seems 😅
SELECT count(*)
FROM eap_items_1_local
Query id: 4dc1eae8-969c-454b-885c-fd886d912b0e
┌────count()─┐
1. │ 1369086710 │ -- 1.37 billion
└────────────┘
I upgraded from 25.8.0 to 25.9.0
Any news on this? I've had this issue since 24.8.0, I was hoping it would be fixed in this release
Any news on this? I've had this issue since 24.8.0, I was hoping it would be fixed in this release
24.8.0 a typo? Or have you run EAP with some other early access feature-flags?
I did for 25.8.0, and waited with raising the issue until GA of these features, like ourlogs and new stats UI/Taskbroker/other stuff I believed to be related.
Oh, yes, mistake, 25.8.0
I am assuming we are not doing a contest here but:
┌────count()─┐
1. │ 7507373159 │ -- 7.51 billion
└────────────┘
And indeed unless I manage to set the time interval to 24 hours the insights pages (in my case Laravel Back-end) just crash the server (16c/64GB and 64GB swap on NVMe storage). Good thing is that it defaults to 14 days so everytime I need to open the page, switch to 24h and wait 15-30 minutes for the server to calm down.
Curious if there is something to fix here or that it's because we run Clickhouse on a single node and it cannot distribute the load or something and we are just running into limitations of the current setup sentry.io has no problem with because they probably have quite a few nodes in their Clickhouse cluster.
This did not seem to be new this release though and already happening in 25.8.0 but I might be wrong since I'm using nightly builds most of the time, so there might be a code change related to this or a missed migration missing an index or whetever.
I adjusted my sentry.conf.py to the following nodestore config:
SENTRY_NODESTORE = "sentry_nodestore_s3.S3PassthroughDjangoNodeStorage"
SENTRY_NODESTORE_OPTIONS = {
"compression": True,
"endpoint_url": "http://seaweedfs:8333",
"bucket_path": "nodestore",
"bucket_name": "nodestore",
"region_name": "us-east-1",
"aws_access_key_id": "sentry",
"aws_secret_access_key": "sentry",
"delete_through": True, # delete through to the Django nodestore (delete object from S3 and Django)
"write_through": False, # write through to the Django nodestore (duplicate writes to S3 and Django)
"read_through": True, # read through to the Django nodestore (if object not found in S3)
}
and it made at least the issue detail page work again.
The insights area is still broken
Doubling down on @stayallive I went from 25.8.0 to Nightly to test pgbouncer setup - and actually back to 25.8.0. MAY have been a missed migration; Because I saw these in 25.8.0 as well, with some EAP feature flags.
Yep, insights is pretty broken for us as well (couple of billion rows in that table). Anyone aware of a setting with which we could make insights default to 24 hours instead of 14 days? I'm assuming that this must be a problem with the table definition/migrations, since this apparently working in Sentry cloud on similar volumes, we run this with 64GB of RAM, most of which is un-utilized.
FYI - have been able to mitigate this (I think) by disabling the use-eap flag for insights:
https://github.com/getsentry/self-hosted/commit/8ad99169994e6c67f50488e27646d30c9ca80703#diff-f4bf6cddc34b3c80a2bb2497a46d5e7315424593993d0ce41ab7c48064ba165cR327
@aldy505 @phacops Hi! Could you please let me know when someone more experienced than us can look at this issue? We can't use Sentry because nothing is loading.
Following debug-session in Discord (#self-hosted), "some" queries that overspend resources are (from query log):
SELECT query
FROM clusterAllReplicas(default, system.query_log)
WHERE initial_query_id = 'randomized-d5e5b196fece40c39504078ab11a4a96'
Query id: fbce3fb5-8525-4838-ba45-20abe23fbd73
┌─query──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
1. │ SELECT (cast(lower(leftPad(hex(item_id), if(greater(length(hex(item_id)), 16), 32, 16), '0')), 'String') AS `sentry.item_id_TYPE_STRING`), (if(mapContains(attributes_string_19, 'sentry.op'), arrayElement(attributes_string_19, 'sentry.op'), NULL) AS `sentry.op_TYPE_STRING`), (if(mapContains(attributes_string_11, 'sentry.raw_description'), arrayElement(attributes_string_11, 'sentry.raw_description'), NULL) AS `sentry.raw_description_TYPE_STRING`), (if(mapContains(attributes_float_5, 'sentry.duration_ms'), arrayElement(attributes_float_5, 'sentry.duration_ms'), NULL) AS `sentry.duration_ms_TYPE_DOUBLE`), (coalesce(if(mapContains(attributes_string_20, 'sentry.transaction'), arrayElement(attributes_string_20, 'sentry.transaction'), NULL), if(mapContains(attributes_string_3, 'transaction'), arrayElement(attributes_string_3, 'transaction'), NULL)) AS `sentry.transaction_TYPE_STRING`), (cast(timestamp, 'Float64') AS `sentry.timestamp_TYPE_DOUBLE`), (transform(CAST(ifNull((cast(project_id, 'Int64') AS `sentry.project_id_TYPE_INT`), ''), 'String'), ['13', '17', '10', '9', '7', '5', '2', '14', '8', '3', '12'], ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k'], 'unknown') AS project), (cast(replaceAll(toString(trace_id), '-', ''), 'String') AS `sentry.trace_id_TYPE_STRING`), (if(mapContains(attributes_string_11, 'sentry.segment_id'), arrayElement(attributes_string_11, 'sentry.segment_id'), NULL) AS `sentry.segment_id_TYPE_STRING`), (transform(CAST(ifNull(`sentry.project_id_TYPE_INT`, ''), 'String'), ['13', '17', '10', '9', '7', '5', '2', '14', '8', '3', '12'], ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k'], 'unknown') AS `project.name`) FROM eap_items_1_local WHERE in(project_id, [2, 3, 5, 7, 8, 9, 10, 12, 13, 14, 17]) AND equals(organization_id, 1) AND less(timestamp, toDateTime(1759217701)) AND greaterOrEquals(timestamp, toDateTime(1758008041)) AND true AND equals(item_type, 1) ORDER BY `sentry.timestamp_TYPE_DOUBLE` DESC LIMIT 51 OFFSET 0 │
2. │ SELECT (cast(lower(leftPad(hex(item_id), if(greater(length(hex(item_id)), 16), 32, 16), '0')), 'String') AS `sentry.item_id_TYPE_STRING`), (if(mapContains(attributes_string_19, 'sentry.op'), arrayElement(attributes_string_19, 'sentry.op'), NULL) AS `sentry.op_TYPE_STRING`), (if(mapContains(attributes_string_11, 'sentry.raw_description'), arrayElement(attributes_string_11, 'sentry.raw_description'), NULL) AS `sentry.raw_description_TYPE_STRING`), (if(mapContains(attributes_float_5, 'sentry.duration_ms'), arrayElement(attributes_float_5, 'sentry.duration_ms'), NULL) AS `sentry.duration_ms_TYPE_DOUBLE`), (coalesce(if(mapContains(attributes_string_20, 'sentry.transaction'), arrayElement(attributes_string_20, 'sentry.transaction'), NULL), if(mapContains(attributes_string_3, 'transaction'), arrayElement(attributes_string_3, 'transaction'), NULL)) AS `sentry.transaction_TYPE_STRING`), (cast(timestamp, 'Float64') AS `sentry.timestamp_TYPE_DOUBLE`), (transform(CAST(ifNull((cast(project_id, 'Int64') AS `sentry.project_id_TYPE_INT`), ''), 'String'), ['13', '17', '10', '9', '7', '5', '2', '14', '8', '3', '12'], ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k'], 'unknown') AS project), (cast(replaceAll(toString(trace_id), '-', ''), 'String') AS `sentry.trace_id_TYPE_STRING`), (if(mapContains(attributes_string_11, 'sentry.segment_id'), arrayElement(attributes_string_11, 'sentry.segment_id'), NULL) AS `sentry.segment_id_TYPE_STRING`), (transform(CAST(ifNull(`sentry.project_id_TYPE_INT`, ''), 'String'), ['13', '17', '10', '9', '7', '5', '2', '14', '8', '3', '12'], ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k'], 'unknown') AS `project.name`) FROM eap_items_1_local WHERE in(project_id, [2, 3, 5, 7, 8, 9, 10, 12, 13, 14, 17]) AND equals(organization_id, 1) AND less(timestamp, toDateTime(1759217701)) AND greaterOrEquals(timestamp, toDateTime(1758008041)) AND true AND equals(item_type, 1) ORDER BY `sentry.timestamp_TYPE_DOUBLE` DESC LIMIT 51 OFFSET 0 │
Using config.xml:
<query_log>
<database>system</database>
<table>query_log</table>
<ttl>event_date + INTERVAL 1 day</ttl>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<max_size_rows>1048576</max_size_rows>
<reserved_size_rows>8192</reserved_size_rows>
<buffer_size_rows_flush_threshold>524288</buffer_size_rows_flush_threshold>
<flush_on_crash>false</flush_on_crash>
</query_log>
To debug clickhouse queries one has to adjust their clickhouse/config.xml and replace
<query_log remove="remove"/>
with
<query_log>
<database>system</database>
<table>query_log</table>
<ttl>event_date + INTERVAL 1 day</ttl>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<max_size_rows>1048576</max_size_rows>
<reserved_size_rows>8192</reserved_size_rows>
<buffer_size_rows_flush_threshold>524288</buffer_size_rows_flush_threshold>
<flush_on_crash>false</flush_on_crash>
</query_log>
then restart your sentry as whole via docker compose down && docker compose up -d --wait and wait a few minutes.
After that you need to open a clickhouse cli via
docker compose exec clickhouse clickhouse-client
and execute the following query:
SELECT
type,
event_time,
initial_query_id,
formatReadableSize(memory_usage) AS memory,
`ProfileEvents.Values`[indexOf(`ProfileEvents.Names`, 'UserTimeMicroseconds')] AS userCPU,
`ProfileEvents.Values`[indexOf(`ProfileEvents.Names`, 'SystemTimeMicroseconds')] AS systemCPU,
normalizedQueryHash(query) AS normalized_query_hash
FROM system.query_log
ORDER BY memory_usage DESC
LIMIT 10
this will result in something like
┌─type─────────────────────┬──────────event_time─┬─initial_query_id────────────────────────────┬─memory─────┬──userCPU─┬─systemCPU─┬─normalized_query_hash─┐
1. │ ExceptionWhileProcessing │ 2025-09-30 07:22:00 │ randomized-ab42c85f7f6b44faa114e43bb201d0cc │ 1.20 GiB │ 36715919 │ 4302509 │ 13521358877695970890 │
2. │ QueryFinish │ 2025-09-30 07:21:30 │ randomized-1488b44299044b69ba484f029b58bdfa │ 1.04 GiB │ 26935740 │ 3747378 │ 13521358877695970890 │
3. │ QueryFinish │ 2025-09-30 07:17:41 │ 3aafa240-b881-45d1-b766-22e1ae771af5 │ 944.99 MiB │ 359105 │ 171500 │ 7330655308746279451 │
4. │ QueryFinish │ 2025-09-30 07:19:39 │ cb070472-3fcb-4db7-bbd1-2b770e96d056 │ 929.71 MiB │ 258367 │ 93965 │ 7330655308746279451 │
5. │ QueryFinish │ 2025-09-30 07:19:50 │ 2029f173-a5ba-462a-ae9b-080367cf1fb8 │ 913.64 MiB │ 217354 │ 109247 │ 7330655308746279451 │
6. │ QueryFinish │ 2025-09-30 07:22:01 │ a2558303-bf92-4e99-aa98-0d67b93ee9e2 │ 913.15 MiB │ 379315 │ 113498 │ 7330655308746279451 │
7. │ QueryFinish │ 2025-09-30 07:20:04 │ 51b71240-52f2-4f39-9168-548dd6f74e3e │ 910.61 MiB │ 238052 │ 57025 │ 7330655308746279451 │
8. │ QueryFinish │ 2025-09-30 07:22:13 │ randomized-250237e9df7049dcb8f415d9c76db28b │ 399.61 MiB │ 80140405 │ 2365666 │ 467395772582250954 │
9. │ QueryFinish │ 2025-09-30 07:22:08 │ 97d015f5-b673-4714-a813-91073f70f83a │ 271.14 MiB │ 17948 │ 23297 │ 7949310113346595683 │
10. │ QueryFinish │ 2025-09-30 07:20:04 │ 69e89ea2-fbd5-412f-bf8e-000cfe4eacc4 │ 270.22 MiB │ 10641 │ 11810 │ 7949310113346595683 │
└──────────────────────────┴─────────────────────┴─────────────────────────────────────────────┴────────────┴──────────┴───────────┴───────────────────────┘
which doesn't return the raw SQL query, because it needs to be fetched via
SELECT query
FROM clusterAllReplicas(default, system.query_log)
WHERE initial_query_id = '<id>'
For me, I got message.txt
I did another test and found another sql query, which takes up > 1GB of RAM:
6. │ ExceptionWhileProcessing │ 2025-09-30 10:32:09 │ randomized-781026ea3ec64a0b941a16f5dbe5fa53 │ 1.23 GiB │ 166710882 │ 4161280 │ 6013952681561621165 │
For those who don't join Sentry's Discord: Pierre is looking at this
Hi, sorry for being annoying, but is there any insight into how to solve this problem and will it be released before 25.10.0 release?
For me I can make the Insights Backend "work" again if I filter the view by a specific project instead of showing the data of all my projects.
Something about the query, which contains > 200 project id's and related data seems to break clickhouse 😁
Still not a "true solution" of course but a workaround to use the backend again.
No, I don't think this will be fixed when 25.11.0 released.
+1 with this issue
I have the same problem with Sentry 25.10.
Same problem on 25.9.0. Does 25.8.0 works well?
Same problem on 25.9.0. Does 25.8.0 works well?
I believe depending on your spans/transaction data, it might work if you have little amount. Most people that impacted with this problem has more than 5M spans per day, the current workaround is just to choose lower time interval.
The fix should be implemented on 25.12.0 release.
Found a fix for this. The issue is that Snuba has downsampled tables (eap_items_1_downsample_8_local, etc.) but the default threshold to use them is set to 1 billion rows, which is too high for most self-hosted instances.
You can lower this threshold via Snuba's runtime config in Redis:
docker compose exec redis redis-cli -n 1 HSET snuba-config "OutcomesBasedRoutingStrategy.max_items_before_downsampling" "100000000"
This sets the threshold to 100 million. After this change, queries started hitting the 8x downsampled table instead of the full one, and query times improved significantly.
To verify Snuba sees the config:
docker compose exec snuba-api python3 -c "from snuba import state; print(state.get_int_config('OutcomesBasedRoutingStrategy.max_items_before_downsampling'))"
Should print 100000000. If it prints None, you may need to check the correct Redis db number for your setup.
To verify queries use downsampled tables, check system.processes in ClickHouse - you should see queries going to eap_items_1_downsample_8_local instead of eap_items_1_local.
To revert:
docker compose exec redis redis-cli -n 1 HDEL snuba-config "OutcomesBasedRoutingStrategy.max_items_before_downsampling"
This only affects query routing (reads), not data ingestion. Config persists in Redis across container restarts.
@nomarek Thank you so much, this really works and I can get 90 days of data without any errors! Hooray!
Thank you; I gave this a go, and it seems to improve the performance, but some of the graphs still don't load
I opted to go a little lower "10000000" which seems to allow 90 day graphs to also load in a reasonable time now. You can play around with the threshold for your own dataset / server. But awesome we found this, now to find a nice way to have this be configurable using the .env or something because this is a great fix.
Found a fix for this. The issue is that Snuba has downsampled tables (
eap_items_1_downsample_8_local, etc.) but the default threshold to use them is set to 1 billion rows, which is too high for most self-hosted instances.
Nice find. The craziest thing about this is that those dashboard views are running those heavy aggregation queries on-demand, and not on a MV or some other pre-computed aggregation.
I think the latest release solved it for us (25.11.1)