posthog
posthog copied to clipboard
on self-hosted new session recordings are dropped due to high_water_mark_partition
Bug description
This is related to https://github.com/PostHog/posthog/issues/21391 which was closed, but not resolved.
Main symptom is, on self-hosted, session-recordings stop collecting very soon after initial PH install. While there were a number of misconfiguration problems related to blob storage (solved in https://github.com/PostHog/posthog/pull/22268), new recordings still stopped coming after certain period.
Additional context
Brief look inside the code gave me an idea this might be related to: https://github.com/PostHog/posthog/blob/06297ca07681858905afca3671bf96f76318fb9b/plugin-server/src/main/ingestion-queues/session-recording/session-recordings-consumer.ts#L296
After I enabled debug mode, this surely was the case. Instance logs:
posthog-plugins-1 | highWaterMarks 1538193
posthog-plugins-1 | {"level":"info","time":1715759467530,"pid":44880,"hostname":"08c62c65cb24","size":1,"partitionsInBatch":[0],"assignedPartitions":[0],"msg":"[MAIN] blob_ingester_consumer - handling batch"}
posthog-plugins-1 | highWaterMarks 1538193
posthog-plugins-1 | {"level":"info","time":1715759467530,"pid":44880,"hostname":"08c62c65cb24","partition":0,"topic":"session_recording_snapshot_item_events","rawSize":22145,"lowOffset":1399832,"highOffset":1399832,"timestamp":1715759467472,"consoleLogIngestionEnabled":true,"msg":"[MAIN] [blob_ingester_consumer] - [PARTITION DEBUG] - consuming event"}
posthog-plugins-1 | {"level":"info","time":1715759467530,"pid":44880,"hostname":"08c62c65cb24","partition":0,"topic":"session_recording_snapshot_item_events","rawSize":22145,"lowOffset":1399832,"highOffset":1399832,"timestamp":1715759467472,"consoleLogIngestionEnabled":true,"dropCause":"high_water_mark_partition","msg":"[MAIN] [blob_ingester_consumer] - [PARTITION DEBUG] - dropping event"}
posthog-plugins-1 | highWaterMarks 1538193
In other words, some condition results in very high highWaterMarks
value and it prevents new recordings until message offsets reach this value, and I observed session recordings sporadically restart collecting, only to stop again later.
If I remove message dropping condition, recordings start flowing in again.
I could not grasp watermark setting logic, but clearly this should not be happening. Kafka runs fine, MinIO is available and stores / serves past recordings fine.
This was occurring on all recent installs (at least 2 weeks back), i.e. this bug is consistent and experienced by many users.
Not sure how helpful, but Kafka's status page:
Debug info
Debug info
No response
Thinking out loud, could it be that the misconfiguration ~~we~~ you fixed meant that the blob ingester has been spinning with errors and so (somehow) ended up with a much higher committed offset than it should have.
We do de-duplication on playback so you should be able to either delete the offset in redis or set it to some sensible value
What we should observe then is that ingestion proceeds correctly and the high-water mark protects against reingestion caused by rebalances
Can you test that?
If that isn't the case then there's definitely something else up here... my hope is the installations have got into an unexpected state and "normal operations" are then unhelpful
So, what I did today:
- waited for offset to reach watermark value to ensure that session recordings start flowing in;
- waited for several hours and updated (rebuild) posthog container;
- this (update) reset offset to 0 but kept watermark at previous value, i.e. now recordings are dropped again until offset reaches previous (before container update) watermark value:
IN] blob_ingester_consumer - handling batch"}
posthog-plugins-1 | { offset: 19565 } watermark 2163822
posthog-plugins-1 | { offset: 19565 } watermark 2168982
posthog-plugins-1 | { offset: 19565 } watermark 2168982
posthog-plugins-1 | {"level":"info","time":1715836116448,"pid":1900,"hostname":"3b0dc39c06d1","size":1,"partitionsInBatch":[0],"assignedPartitions":[0],"msg":"[MAIN] blob_ingester_consumer - handling batch"}
posthog-plugins-1 | { offset: 19566 } watermark 2163822
posthog-plugins-1 | { offset: 19566 } watermark 2168982
posthog-plugins-1 | { offset: 19566 } watermark 2168982
^Ccanceled
My question would be, is it expected behaviour and some migration steps are required to reset watermark count or there's a bug in code that does not reset watermark when needed?
restarted containers upon update:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS
NAMES
293a2541b3ba caddy:2.6.1 "caddy run --config " 9 minutes ago Up 9 minutes 0.0.0.0:80->80/tcp, :::80->80/tcp,
0.0.0.0:443->443/tcp, :::443->443/tcp, 443/udp, 2019/tcp posthog-caddy-1
3e1af0f3a82d posthog/posthog:c2046d5dd1e2fa01614c6077e30cf06c4b8897f1 "/usr/local/bin/dock " 9 minutes ago Up 9 minutes 80/tcp, 8000-8001/tcp
posthog-web-1
66652e51e861 posthog/posthog:c2046d5dd1e2fa01614c6077e30cf06c4b8897f1 "/usr/local/bin/dock " 9 minutes ago Up 9 minutes 80/tcp, 8000-8001/tcp
posthog-temporal-django-worker-1
3b0dc39c06d1 posthog/posthog:c2046d5dd1e2fa01614c6077e30cf06c4b8897f1 "/usr/local/bin/dock " 9 minutes ago Up 9 minutes 80/tcp, 8000-8001/tcp
posthog-plugins-1
7f6d7521b807 posthog/posthog:c2046d5dd1e2fa01614c6077e30cf06c4b8897f1 "/usr/local/bin/dock " 9 minutes ago Up 9 minutes 80/tcp, 8000-8001/tcp
posthog-worker-1
c5a63269be51 clickhouse/clickhouse-server:23.12.5.81-alpine "/entrypoint.sh" 39 minutes ago Up 38 minutes 8123/tcp, 9000/tcp, 9009/tcp
posthog-clickhouse-1
6090278ec56c temporalio/admin-tools:1.20.0 "tail -f /dev/null" 39 minutes ago Up 38 minutes
posthog-temporal-admin-tools-1
a90db85272a0 ghcr.io/posthog/kafka-container:v2.8.2 "/opt/bitnami/script " 39 minutes ago Up 38 minutes 9092/tcp
posthog-kafka-1
520f762e9e20 temporalio/ui:2.10.3 "./start-ui-server.sh" 39 minutes ago Up 38 minutes 0.0.0.0:8081->8080/tcp, :::8081->808
0/tcp posthog-temporal-ui-1
21cc4d453198 temporalio/auto-setup:1.20.0 "/etc/temporal/entry " 39 minutes ago Up 38 minutes 6933-6935/tcp, 6939/tcp, 7234-7235
/tcp, 7239/tcp, 0.0.0.0:7233->7233/tcp, :::7233->7233/tcp posthog-temporal-1
203bdfd8a63b provectuslabs/kafka-ui:latest "/bin/sh -c 'java -- " 39 minutes ago Up 38 minutes 0.0.0.0:8088->8080/tcp, :::8088->8
080/tcp posthog-kafka_ui-1
0e6745748ef4 redis:6.2.7-alpine "docker-entrypoint.s " 39 minutes ago Up 38 minutes 6379/tcp
posthog-redis-1
d069602094a0 zookeeper:3.7.0 "/docker-entrypoint. " 39 minutes ago Up 38 minutes 2181/tcp, 2888/tcp, 3888/tcp, 8080
/tcp posthog-zookeeper-1
b9975b1342c8 postgres:12-alpine "docker-entrypoint.s " 39 minutes ago Up 38 minutes (healthy) 5432/tcp
posthog-db-1
d32951b1166f minio/minio:RELEASE.2024-05-10T01-41-38Z.fips "sh -c 'mkdir -p /da " 39 minutes ago Up 38 minutes 9000/tcp, 0.0.0.0:19000-19001->190
00-19001/tcp, :::19000-19001->19000-19001/tcp posthog-objectstorage-1
5edfb241970b mher/flower:2.0.0 "celery flower" 47 hours ago Up 47 hours 0.0.0.0:5555->5555/tcp, :::5555->555
5/tcp posthog-flower-1
57fb5fb4e064 elasticsearch:7.16.2 "/bin/tini -- /usr/l " 4 days ago Up 4 days 9200/tcp, 9300/tcp
posthog-elasticsearch-1
updated (rebuild) posthog container
-> this (update) reset offset to 0
hmmm we might be missing something here for sure. if we were to do kafka maintenance such that the offset was reset we'd manually reset the redis watermark... (we try to avoid duplication of ingestion but the playback is relatively good at deduplicating so we might be more complex than that in reality but it's certainly close enough for rock and roll...)
so there's something that needs fixing or documenting here.
updated (rebuild) posthog container
can you clarify what this step is? why you'd do it?
(thanks again for these super clear, deep reports)
We update docker-compise.yml
with a new container image hash for Posthog images:
web:
extends:
file: docker-compose.base.yml
service: web
command: /compose/start
volumes:
- ./compose:/compose
image: posthog/posthog:c2046d5dd1e2fa01614c6077e30cf06c4b8897f1 <---- here
and run docker-compose -f "docker-compose.yml" up -d
to pull and rebuild updated images
ooh and that resets the kafka offset 🤯
@frankh @fuziontech I'm well beyond my docker knowledge here... but there's a gap between how we operate session replay and what docker compose is doing for self-hosted...
is there a way to avoid the kafka offset reset here?
Nice troubleshooting! Just wanted to add my 2 cents, since we were struggling to get a stable self-hosted deployment recently. I think we got everything working smoothly now, but we did apply quite a few changes.
One of the changes that we have is that we added persistence do the redis
container. I think that's also where HighWaterMarker
is being stored. Redis is wiped with every container recreation without mounting the /data
directory.
Part of our docker-compose.yml (we also bind-mount to host filesystem instead of using docker volumes):
redis:
extends:
file: docker-compose.base.yml
service: redis
restart: always
volumes:
- /srv/posthog/redis:/data
We've also added persistence for the kafka
container, although I'm not sure if it's relevant in this case:
kafka:
extends:
file: docker-compose.base.yml
service: kafka
restart: always
depends_on:
- zookeeper
volumes:
- /srv/posthog/kafka/data:/bitnami/kafka
Redis persistence is what is needed here I think, good catch @jgoclawski , will try this on next update.
I guess both, kafka and redis, need to write to disk in order for watermark values to survive rebuilds... will test and create PR if it works.
Thanks for another fix @feedanal 😍
I'll close this (but folk can re-open if needed)