EventStore icon indicating copy to clipboard operation
EventStore copied to clipboard

Recurring silent crash

Open mmuller99 opened this issue 2 years ago • 18 comments

Describe the bug Our EventStore instance crashes silently between 30-40 days recurringly. There are no logs that indicate any error or warning, just shuts down. We have a large single instance running locally of close to 5TB.

To Reproduce Steps to reproduce the behavior:

  1. No steps

Expected behavior EventStore should remain up.

Actual behavior EventStore crashes recurringly between 30-40 days of uptime.

Config/Logs/Screenshots Config:

EventStore details

  • EventStore server version: 21.10.7
  • Operating system: Ubuntu 20.04.2 with ES running in Docker
  • EventStore client version (if applicable):

Additional context With no logs indicating anything, we've been very hesitant to log this issue, but the hope is that other EventStore users might experience the same issue.

DB-333

mmuller99 avatar Aug 11 '22 12:08 mmuller99

Hi @mmuller99, how much memory does the instance have available? Is it possible that the instance is running out of memory and killing the process?

hayley-jean avatar Aug 12 '22 08:08 hayley-jean

Hi @hayley-jean, our server has 64GB and only runs ES. We do keep 32GB of chunks in mem, but we've never seen the server's memory jump. We currently have 24GB of mem available. I initially thought it could be index merges that's the culprit, cause we recently had a big merge into a 184GB index file, so thought prior crashes could be due to that merge failing. But just this Monday, it crashed again silently, so that theory is out of the window.

It has probably crashed about 6-8 times and every time we do log inspections and server health checks, but nothing jumps out.

Could be docker env related, I don't know. As seen above, we're using 21.10.7-buster-slim

mmuller99 avatar Aug 12 '22 14:08 mmuller99

Only error in the logs(and some slow queue msg ones), but just continues to run after the error:

image

But these crashes have been occuring before the bloom filters for the index files were introduced

mmuller99 avatar Aug 12 '22 14:08 mmuller99

@mmuller99 Thanks for the additional info

While the index merges are unlikely to be the cause, you may want to set the MaxAutoMergeIndexLevel setting to a lower value to prevent such large index merges from occurring. You can read more about the indexes in the docs here, and about the MaxAutoMergeIndexLevel setting here.

To further diagnose the silent crashes, you could try and run the EventStoreDB process through systemd within the docker container. Running through systemd would provide you with extra logs about the process, which may shed some light on what's happening when it crashes.

hayley-jean avatar Aug 15 '22 13:08 hayley-jean

@mmuller99 in addition to what @hayley-jean has suggested, you could try setting the stream info cache size. the default of 0 causes it to be dynamically sized on startup according to available memory (you'll find the results of this calculation in the log) but it doesn't take into account the size of the chunks cache so is likely sizing it too large for you (this is something we will be fixing)

The old default before it was dynamically sized used to be 100000. An ideal size is about 2x the number of streams your system needs quick access to for reading/writing. Each entry takes around 1kb (depending on the stream names) so the default takes around 100mb.

timothycoleman avatar Aug 16 '22 07:08 timothycoleman

Thanks @hayley-jean @timothycoleman.

@timothycoleman, the stream info cache size does calculate too high. Will fine tune appropriately, thanks. image

Will revert back with results

mmuller99 avatar Aug 16 '22 09:08 mmuller99

@mmuller99 I have seen one case of this on V21.10.2 which we have upgraded to 21.10.7 and still believe it to be an issue. We are running with a stream info cache size of 100000 which was the default on V5 where we did not see this issue.

The only thing we have seen from the logs is that it seems to be a steady increase in the "SLOW" messages being logged as time goes on up until it restarted (over the course of weeks). Have you also seen this?

PaskeS avatar Aug 19 '22 14:08 PaskeS

Hi @PaskeS, thanks for reaching out. Looking at our logs leading up to the crash, the amount of VERY SLOW QUEUE MSG messages does not seem to scale more and more as time goes on. Unfortunately our workflows sometimes do full stream reads and some very old events are quite big +- 3mb and I believe they are a possible culprit in our messages.

What is your DB age and size? Would you mind dropping your config like above?

mmuller99 avatar Aug 19 '22 15:08 mmuller99

Thanks for quick reply @mmuller99

This particular cluster is 1TB and 5 years old. In terms of the logs what we saw was an uptick in SLOW QUEUE MSG && SLOW BUS MSG in the weeks following up to it. And then about 5 seconds before it crashed we saw VERY SLOW QUEUE MSG. Not handling any large events

Config:

WORKER THREADS:                          5 (Yaml)
SKIP INDEX SCAN ON READS:                true (Yaml)
LOG:                                     REDACTED (Yaml)
STREAM INFO CACHE CAPACITY:              100000 (Yaml)
CLUSTER SIZE:                            5 (Yaml)
NODE PRIORITY:                           1 (Yaml)
DISCOVER VIA DNS:                        false (Yaml)
GOSSIP SEED:                             REDACTED0,REDACTED1,REDACTED2,REDACTED3,REDACTED4 (Yaml)
GOSSIP INTERVAL MS:                      1000 (Yaml)
GOSSIP TIMEOUT MS:                       2500 (Yaml)
DEAD MEMBER REMOVAL PERIOD SEC:          21600 (Yaml)
SCAVENGE HISTORY MAX AGE:                1825 (Yaml)
DB:                                      REDACTED (Yaml)
USE INDEX BLOOM FILTERS:                 false (Yaml)
INDEX CACHE SIZE:                        0 (Yaml)
SKIP DB VERIFY:                          true (Yaml)
PREPARE TIMEOUT MS:                      2000 (Yaml)
COMMIT TIMEOUT MS:                       2000 (Yaml)
SKIP INDEX VERIFY:                       true (Yaml)
ALWAYS KEEP SCAVENGED:                   true (Yaml)
REDUCE FILE CACHE PRESSURE:              true (Yaml)
INITIALIZATION THREADS:                  4 (Yaml)
READER THREADS COUNT:                    4 (Yaml)
STREAM EXISTENCE FILTER SIZE:            0 (Yaml)
INT IP:                                  REDACTED (Yaml)
EXT IP:                                  REDACTED (Yaml)
HTTP PORT:                               REDACTED (Yaml)
ENABLE EXTERNAL TCP:                     true (Yaml)
INT TCP PORT:                            REDACTED (Yaml)
EXT TCP PORT:                            REDACTED (Yaml)
EXT HOST ADVERTISE AS:                   REDACTED (Yaml)
INT HOST ADVERTISE AS:                   REDACTED (Yaml)
ADVERTISE HOST TO CLIENT AS:             REDACTED (Yaml)
INT TCP HEARTBEAT TIMEOUT:               2500 (Yaml)
EXT TCP HEARTBEAT TIMEOUT:               2500 (Yaml)
INT TCP HEARTBEAT INTERVAL:              1000 (Yaml)
EXT TCP HEARTBEAT INTERVAL:              2000 (Yaml)
ENABLE ATOM PUB OVER HTTP:               true (Yaml)
RUN PROJECTIONS:                         All (Yaml)
PROJECTION THREADS:                      3 (Yaml)
PROJECTION EXECUTION TIMEOUT:            2000 (Yaml)```

PaskeS avatar Aug 19 '22 16:08 PaskeS

Thanks @PaskeS, nothing stands out to me. What environment is your cluster running in?

mmuller99 avatar Aug 19 '22 16:08 mmuller99

@mmuller99 We saw this in our live environment which made it very noticable and we have taken to restarting it before it gets to this point to have some control

PaskeS avatar Aug 19 '22 16:08 PaskeS

Apologies, meant physical env. We're running Ubuntu 20.04 with docker using 21.10.7-buster-slim image.

mmuller99 avatar Aug 19 '22 16:08 mmuller99

This is on windows so if it is the same issue then negates the physical env aspect

PaskeS avatar Aug 19 '22 16:08 PaskeS

Thanks @PaskeS, good thing to be able to negate.

mmuller99 avatar Aug 19 '22 16:08 mmuller99

Sorry for going sideways, but why do you run ESDB in a container if you have a dedicated machine? The Docker environment has its own constraints, which don't necessarily align with the resources available for the physical machine.

alexeyzimarev avatar Oct 27 '22 10:10 alexeyzimarev

Hi @mmuller99, did the reduced stream info cache capacity setting help with this?

timothycoleman avatar Sep 14 '23 07:09 timothycoleman

Hi @timothycoleman , yeah still happening. Crash without any error. We just restart and we are back up within 45 minutes.

mmuller99 avatar Sep 14 '23 08:09 mmuller99

@mmuller99 sorry if i missed it above, are there any messages in the docker log itself around the time the node exits? i wonder if the container runtime might be killing it

timothycoleman avatar Sep 22 '23 06:09 timothycoleman