thanos receive: memory spike when running in RouteOnly mode

Thanos, Prometheus and Golang version used: v0.32.2

Object Storage Provider: none

What happened:

We are running a thanos receive router deployment in front of our receive ingestors, which currently handles about 50 reqs / second. During normal operation, the router pods handle this easily with 2 replicas, each using less than 70MB of memory each.

However, when one of the pods in the statefulset of the ingestor pods running behind that instance is temporarily unavailable - and then comes back online after 5 minutes or so, there is a spike in incoming remote write requests, as the clients are retrying successfully.

When that happens, within 3-5 seconds the receive router pod memory shoots up from 70MB to over 2000MB of memory usage (which is our limit and therefore results in an OOMKill)

What you expected to happen:

it's unclear to me why this is happening. What is all the memory used for in the receive router?

Is there a way to avoid this?

Oct 09 '23 19:10 defreng

What's your replication factor? In practice, it should be at least 3 to allow for downtime.

Oct 10 '23 09:10 GiedriusS

@GiedriusS as this deployment is only handling non-critical data, we only have a replication factor of 1 and accept data loss incase of issues

Oct 10 '23 15:10 defreng

Can you share some of the configuration you're using to create these Receive instances?

Oct 25 '23 21:10 epeters-jrmngndr

sure! this is our configuration:

          args:
            - receive
            - --log.level=info
            - --log.format=logfmt
            - --grpc-address=0.0.0.0:10901
            - --http-address=0.0.0.0:10902
            - --remote-write.address=0.0.0.0:19291
            - --receive.replication-factor=1
            - --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
            - --receive.hashrings-algorithm=ketama
            - --label=receive="true"

Oct 26 '23 15:10 defreng

Hi, unfortunately nothing can be done here. Replication factor of 1 doesn't allow any downtime and if Prometheus cannot send metrics then it retries hence increasing memory usage of Receive. There was movement to change how quorum works but that's outside the scope of this ticket.

Oct 26 '23 16:10 GiedriusS

@defreng have you checked the thnaos-receive-controller project? It should cover your problem related to replication-factor=1, and the system will avoid experiencing downtime for write metrics

Oct 28 '23 18:10 hayk96

Hi

in the meantime we updated our configuration to replication factor 3. And also allocated more resources to the router (8GB mem, which during normal operation is only utilized up to 5% or so).

However, in the event of some downtime on the ingestors (which unfortunately from time to time still happens), once the ingestors come back online, the routers are all overwhelmed with requests and dying due to an OOMKill (8GB, after 2-3 seconds).

Do you have any suspicion what mechanism is eating all the memory? Would it make sense to be able to limit the concurrent requests the router can handle?

Dec 22 '23 15:12 defreng

thanos thanos copied to clipboard

receive: memory spike when running in RouteOnly mode

thanos
thanos copied to clipboard