posthog icon indicating copy to clipboard operation
posthog copied to clipboard

on self-hosted new session recordings are dropped due to high_water_mark_partition

Open feedanal opened this issue 9 months ago • 9 comments

Bug description

This is related to https://github.com/PostHog/posthog/issues/21391 which was closed, but not resolved.

Main symptom is, on self-hosted, session-recordings stop collecting very soon after initial PH install. While there were a number of misconfiguration problems related to blob storage (solved in https://github.com/PostHog/posthog/pull/22268), new recordings still stopped coming after certain period.

Additional context

Brief look inside the code gave me an idea this might be related to: https://github.com/PostHog/posthog/blob/06297ca07681858905afca3671bf96f76318fb9b/plugin-server/src/main/ingestion-queues/session-recording/session-recordings-consumer.ts#L296

After I enabled debug mode, this surely was the case. Instance logs:

posthog-plugins-1                 | highWaterMarks 1538193
posthog-plugins-1                 | {"level":"info","time":1715759467530,"pid":44880,"hostname":"08c62c65cb24","size":1,"partitionsInBatch":[0],"assignedPartitions":[0],"msg":"[MAIN]  blob_ingester_consumer - handling batch"}
posthog-plugins-1                 | highWaterMarks 1538193
posthog-plugins-1                 | {"level":"info","time":1715759467530,"pid":44880,"hostname":"08c62c65cb24","partition":0,"topic":"session_recording_snapshot_item_events","rawSize":22145,"lowOffset":1399832,"highOffset":1399832,"timestamp":1715759467472,"consoleLogIngestionEnabled":true,"msg":"[MAIN]  [blob_ingester_consumer] - [PARTITION DEBUG] - consuming event"}
posthog-plugins-1                 | {"level":"info","time":1715759467530,"pid":44880,"hostname":"08c62c65cb24","partition":0,"topic":"session_recording_snapshot_item_events","rawSize":22145,"lowOffset":1399832,"highOffset":1399832,"timestamp":1715759467472,"consoleLogIngestionEnabled":true,"dropCause":"high_water_mark_partition","msg":"[MAIN]  [blob_ingester_consumer] - [PARTITION DEBUG] - dropping event"}
posthog-plugins-1                 | highWaterMarks 1538193

In other words, some condition results in very high highWaterMarks value and it prevents new recordings until message offsets reach this value, and I observed session recordings sporadically restart collecting, only to stop again later.

If I remove message dropping condition, recordings start flowing in again.

I could not grasp watermark setting logic, but clearly this should not be happening. Kafka runs fine, MinIO is available and stores / serves past recordings fine.

This was occurring on all recent installs (at least 2 weeks back), i.e. this bug is consistent and experienced by many users.

Not sure how helpful, but Kafka's status page:

image

Debug info

image

Debug info

No response

feedanal avatar May 15 '24 08:05 feedanal

Thinking out loud, could it be that the misconfiguration ~~we~~ you fixed meant that the blob ingester has been spinning with errors and so (somehow) ended up with a much higher committed offset than it should have.

We do de-duplication on playback so you should be able to either delete the offset in redis or set it to some sensible value

What we should observe then is that ingestion proceeds correctly and the high-water mark protects against reingestion caused by rebalances

Can you test that?

If that isn't the case then there's definitely something else up here... my hope is the installations have got into an unexpected state and "normal operations" are then unhelpful

pauldambra avatar May 15 '24 08:05 pauldambra

So, what I did today:

  • waited for offset to reach watermark value to ensure that session recordings start flowing in;
  • waited for several hours and updated (rebuild) posthog container;
  • this (update) reset offset to 0 but kept watermark at previous value, i.e. now recordings are dropped again until offset reaches previous (before container update) watermark value:
IN]  blob_ingester_consumer - handling batch"}
posthog-plugins-1                 | { offset: 19565 } watermark 2163822
posthog-plugins-1                 | { offset: 19565 } watermark 2168982
posthog-plugins-1                 | { offset: 19565 } watermark 2168982
posthog-plugins-1                 | {"level":"info","time":1715836116448,"pid":1900,"hostname":"3b0dc39c06d1","size":1,"partitionsInBatch":[0],"assignedPartitions":[0],"msg":"[MAIN]  blob_ingester_consumer - handling batch"}
posthog-plugins-1                 | { offset: 19566 } watermark 2163822
posthog-plugins-1                 | { offset: 19566 } watermark 2168982
posthog-plugins-1                 | { offset: 19566 } watermark 2168982
^Ccanceled

My question would be, is it expected behaviour and some migration steps are required to reset watermark count or there's a bug in code that does not reset watermark when needed?

feedanal avatar May 16 '24 05:05 feedanal

restarted containers upon update:

CONTAINER ID   IMAGE                                                      COMMAND                  CREATED          STATUS                    PORTS
                                                          NAMES
293a2541b3ba   caddy:2.6.1                                                "caddy run --config    "   9 minutes ago    Up 9 minutes              0.0.0.0:80->80/tcp, :::80->80/tcp,
 0.0.0.0:443->443/tcp, :::443->443/tcp, 443/udp, 2019/tcp   posthog-caddy-1
3e1af0f3a82d   posthog/posthog:c2046d5dd1e2fa01614c6077e30cf06c4b8897f1   "/usr/local/bin/dock   "   9 minutes ago    Up 9 minutes              80/tcp, 8000-8001/tcp
                                                            posthog-web-1
66652e51e861   posthog/posthog:c2046d5dd1e2fa01614c6077e30cf06c4b8897f1   "/usr/local/bin/dock   "   9 minutes ago    Up 9 minutes              80/tcp, 8000-8001/tcp
                                                            posthog-temporal-django-worker-1
3b0dc39c06d1   posthog/posthog:c2046d5dd1e2fa01614c6077e30cf06c4b8897f1   "/usr/local/bin/dock   "   9 minutes ago    Up 9 minutes              80/tcp, 8000-8001/tcp
                                                            posthog-plugins-1
7f6d7521b807   posthog/posthog:c2046d5dd1e2fa01614c6077e30cf06c4b8897f1   "/usr/local/bin/dock   "   9 minutes ago    Up 9 minutes              80/tcp, 8000-8001/tcp
                                                            posthog-worker-1
c5a63269be51   clickhouse/clickhouse-server:23.12.5.81-alpine             "/entrypoint.sh"         39 minutes ago   Up 38 minutes             8123/tcp, 9000/tcp, 9009/tcp
                                                          posthog-clickhouse-1
6090278ec56c   temporalio/admin-tools:1.20.0                              "tail -f /dev/null"      39 minutes ago   Up 38 minutes
                                                          posthog-temporal-admin-tools-1
a90db85272a0   ghcr.io/posthog/kafka-container:v2.8.2                     "/opt/bitnami/script   "   39 minutes ago   Up 38 minutes             9092/tcp
                                                            posthog-kafka-1
520f762e9e20   temporalio/ui:2.10.3                                       "./start-ui-server.sh"   39 minutes ago   Up 38 minutes             0.0.0.0:8081->8080/tcp, :::8081->808
0/tcp                                                     posthog-temporal-ui-1
21cc4d453198   temporalio/auto-setup:1.20.0                               "/etc/temporal/entry   "   39 minutes ago   Up 38 minutes             6933-6935/tcp, 6939/tcp, 7234-7235
/tcp, 7239/tcp, 0.0.0.0:7233->7233/tcp, :::7233->7233/tcp   posthog-temporal-1
203bdfd8a63b   provectuslabs/kafka-ui:latest                              "/bin/sh -c 'java --   "   39 minutes ago   Up 38 minutes             0.0.0.0:8088->8080/tcp, :::8088->8
080/tcp                                                     posthog-kafka_ui-1
0e6745748ef4   redis:6.2.7-alpine                                         "docker-entrypoint.s   "   39 minutes ago   Up 38 minutes             6379/tcp
                                                            posthog-redis-1
d069602094a0   zookeeper:3.7.0                                            "/docker-entrypoint.   "   39 minutes ago   Up 38 minutes             2181/tcp, 2888/tcp, 3888/tcp, 8080
/tcp                                                        posthog-zookeeper-1
b9975b1342c8   postgres:12-alpine                                         "docker-entrypoint.s   "   39 minutes ago   Up 38 minutes (healthy)   5432/tcp
                                                            posthog-db-1
d32951b1166f   minio/minio:RELEASE.2024-05-10T01-41-38Z.fips              "sh -c 'mkdir -p /da   "   39 minutes ago   Up 38 minutes             9000/tcp, 0.0.0.0:19000-19001->190
00-19001/tcp, :::19000-19001->19000-19001/tcp               posthog-objectstorage-1
5edfb241970b   mher/flower:2.0.0                                          "celery flower"          47 hours ago     Up 47 hours               0.0.0.0:5555->5555/tcp, :::5555->555
5/tcp                                                     posthog-flower-1
57fb5fb4e064   elasticsearch:7.16.2                                       "/bin/tini -- /usr/l   "   4 days ago       Up 4 days                 9200/tcp, 9300/tcp
                                                            posthog-elasticsearch-1

feedanal avatar May 16 '24 05:05 feedanal

updated (rebuild) posthog container -> this (update) reset offset to 0

hmmm we might be missing something here for sure. if we were to do kafka maintenance such that the offset was reset we'd manually reset the redis watermark... (we try to avoid duplication of ingestion but the playback is relatively good at deduplicating so we might be more complex than that in reality but it's certainly close enough for rock and roll...)

so there's something that needs fixing or documenting here.

updated (rebuild) posthog container

can you clarify what this step is? why you'd do it?

(thanks again for these super clear, deep reports)

pauldambra avatar May 20 '24 19:05 pauldambra

We update docker-compise.yml with a new container image hash for Posthog images:

  web:
    extends:
      file: docker-compose.base.yml
      service: web
    command: /compose/start
    volumes:
      - ./compose:/compose
    image: posthog/posthog:c2046d5dd1e2fa01614c6077e30cf06c4b8897f1 <---- here

and run docker-compose -f "docker-compose.yml" up -d to pull and rebuild updated images

feedanal avatar May 21 '24 10:05 feedanal

ooh and that resets the kafka offset 🤯

@frankh @fuziontech I'm well beyond my docker knowledge here... but there's a gap between how we operate session replay and what docker compose is doing for self-hosted...

is there a way to avoid the kafka offset reset here?

pauldambra avatar May 21 '24 10:05 pauldambra

Nice troubleshooting! Just wanted to add my 2 cents, since we were struggling to get a stable self-hosted deployment recently. I think we got everything working smoothly now, but we did apply quite a few changes. One of the changes that we have is that we added persistence do the redis container. I think that's also where HighWaterMarker is being stored. Redis is wiped with every container recreation without mounting the /data directory.

Part of our docker-compose.yml (we also bind-mount to host filesystem instead of using docker volumes):

    redis:
        extends:
            file: docker-compose.base.yml
            service: redis
        restart: always
        volumes:
            - /srv/posthog/redis:/data

We've also added persistence for the kafka container, although I'm not sure if it's relevant in this case:

    kafka:
        extends:
            file: docker-compose.base.yml
            service: kafka
        restart: always
        depends_on:
            - zookeeper
        volumes:
            - /srv/posthog/kafka/data:/bitnami/kafka

jgoclawski avatar May 21 '24 13:05 jgoclawski

Redis persistence is what is needed here I think, good catch @jgoclawski , will try this on next update.

movy avatar May 21 '24 13:05 movy

I guess both, kafka and redis, need to write to disk in order for watermark values to survive rebuilds... will test and create PR if it works.

feedanal avatar May 22 '24 12:05 feedanal

Thanks for another fix @feedanal 😍

I'll close this (but folk can re-open if needed)

pauldambra avatar Jun 01 '24 08:06 pauldambra