meilisearch icon indicating copy to clipboard operation
meilisearch copied to clipboard

MDB_PANIC: Update of meta page failed or environment had fatal error

Open Nextra opened this issue 1 year ago • 10 comments

Note: I know this is going to be a very unsatisfying bug report, but the only mention I can find of this error is #3936, which did not lead to any conclusion.

Describe the bug After some amount of continued runtime, seemingly randomly, Meilisearch will start throwing the error MDB_PANIC: Update of meta page failed or environment had fatal error as response to pretty much any request made to it. Contrary to above discussion, this has nothing to do specifically with the task system, but it does become impossible to send documents to the instance.

At least ingestion and searching break, effectively causing a total outage. When/if the error happens again I will hopefully be able to provide a more detailed breakdown of which functionality is affected. Currently I'm assuming the instance becomes entirely unresponsive. According to monitoring the health route also errors, although I currently do not know which error it reports specifically.

To Reproduce Unclear.

At this moment I am fairly certain that it has nothing to do with a specific document. I was reminded of a similar looking error #2338 and tried some of the same troubleshooting I did back then. But for this issue, as mentioned in the discussion above, often a simple restart is enough to get Meilisearch up and running again. I have tried resetting and re-indexing the data set entirely, and have also downgraded to a prior version, but the bug always takes a few days/weeks to reappear. To make this clear: I have encountered this issue starting from a fresh instance multiple times this month.

Meilisearch version: At least 1.4.x and 1.5.x are affected. I had 1.3.x deployed for a long time before upgrading and I think I did not see this error back then, but my memory might fail me here.

Nextra avatar Dec 20 '23 15:12 Nextra

Hello @Nextra 👋

Thank you very much for this report. I want to ask for more information on the type of disk you are using. The issue is related to LMDB, our internal key-value store, and not specifically to Meilisearch. It is highly recommended to look into the issues related to LMDB on the internet.

Also, just in case it's useful for you, we do also have a cloud offering ⚡️ where we manage your Meilisearch for you. There's a 14 day free trial period if you wanted to try it out.

Kerollmops avatar Dec 20 '23 17:12 Kerollmops

The instance is deployed to Azure ASP Premium V3 (P1V3), as I believe is recommended in your documentation. This has been running for well over a year, but the errors only started to appear recently, I believe after upgrading to 1.5.x from 1.3.x.

I would really like to assist in diagnosis, but I don't know where to start on this.

Unfortunately I need to deploy to private Azure at the moment, so the cloud offering is not useful.

Nextra avatar Dec 20 '23 17:12 Nextra

After a restart, the container now crashes with the following information:

2023-12-20T17:36:22.716603783Z [2023-12-20T17:36:22Z INFO  index_scheduler::batch] document addition done: DocumentAdditionResult { indexed_documents: 1, number_of_documents: 37981 }
2023-12-20T17:36:23.198525014Z /usr/local/cargo/git/checkouts/lmdb-rs-97af4d460cf53f67/501aa34/lmdb-sys/lmdb/libraries/liblmdb/mdb.:2443: Assertion 'rc == 0' failed in mdb_page_dirty()

Nextra avatar Dec 20 '23 19:12 Nextra

Hello Nextra :wave:,

Several people running Azure reported being affected by what looks like DB corruptions: https://github.com/meilisearch/meilisearch/issues/4123

Apparently it is due to the kind of storage used, which is unfortunately the one currently recommended in our docs.

We have an issue open on our documentation repo to change the guide relative to Azure so that we stop unintentionally misleading people, but we are not very comfortable with Azure ourselves (this part of the guide was apparently authored by someone at Microsoft): https://github.com/meilisearch/documentation/issues/2622.

dureuill avatar Dec 21 '23 08:12 dureuill

Interesting.

I have not used the template directly, so there is no separate network share mapped into the App Service. But maybe the persistent storage in /home that is included with the service plan still uses a network share?

The only thing I definitely know is that this is a recent development, as it used to work for months without fault.

Nextra avatar Dec 21 '23 09:12 Nextra

Can confirm the Meilisearch version has nothing to do with it. The errors started appearing less than 24 hours after downgrading to a fresh install of 1.3.5 just yesterday. Maybe something in Azure has changed to make this setup not work anymore.

Why do restarts fix the problem temporarily when the data store is supposedly corrupted?

Nextra avatar Dec 21 '23 11:12 Nextra

Quickly looking at the code of our database backend (LMDB), the MDB_PANIC can happen if:

  1. The call to msync to flush data from the mmap to the backing file failed: https://github.com/LMDB/lmdb/blob/mdb.master/libraries/liblmdb/mdb.c#L4345, or
  2. The writer thread died while holding the write mutex of the environment: https://github.com/LMDB/lmdb/blob/mdb.master/libraries/liblmdb/mdb.c#L11306

Given the filesystem on Azure seems unstable, and this looks independent of the Meilisearch version, my call is that it is (1) happening randomly.

Can you tell me on which filesystem/disk the data.ms resides? Also, what is the OS? Is it a standard Linux, or an Azure specific distribution? Sorry if my questions are a bit basic, I'm not an Azure user.

Why do restarts fix the problem temporarily when the data store is supposedly corrupted?

The issue seems to be that the filesystem offered by Azure is (became 🤔) unreliable for some of the more "exotic" features that our DB backend is using (mostly, the mmap implementation). From there there could be a range of symptoms ranging from failed file flushes (what you're seeing) to full DB corruptions (as reported in other issues).

Once the issue happens in one of your indexes, you need to restart because LMDB remembers that a sync failed and flags the index as unusable for the current process.

dureuill avatar Dec 21 '23 11:12 dureuill

This setup has worked pretty much uninterrupted for this whole year, the issues only started arising when I upgraded to 1.5 when it got released. As I now know this was purely a coincidence, but will give you an idea of the time frame.

Also, what is the OS? Is it a standard Linux, or an Azure specific distribution?

$ uname -a
Linux 5babb03ed2fd 5.15.131.1-2.cm2 #1 SMP Sun Sep 24 03:38:45 UTC 2023 x86_64 GNU/Linux
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"

Can you tell me on which filesystem/disk the data.ms resides?

I have configured it such that the database is located on the persistent storage that is included with the ASP. It seems like this does not make a difference compared to your deployment script (where a separate Azure Storage disk is mounted into the ASP app), because that included storage is still a network disk:

Filesystem                                                                             1K-blocks     Used  Available Use% Mounted on
//10.0.130.54/volume-25-default/onhws9917vtp22509fxp/w6ri90s76b5u63hab218aa101258fw9h 1048574944 16239524 1032335420   2% /home

Nextra avatar Dec 21 '23 12:12 Nextra

This setup has worked pretty much uninterrupted for this whole year, the issues only started arising when I upgraded to 1.5 when it got released

Did you perhaps restart your machine or otherwise perform a system update at the same time you updated the Meilisearch version?

dureuill avatar Dec 21 '23 14:12 dureuill

Not intentionally, but I assume that the VM behind ASP just regularly updates itself in the background. The switch to the new container could have been a possible trigger for such a process.

Nextra avatar Dec 21 '23 15:12 Nextra

Hello :wave:

As we're not going to address supporting the discussed Azure setup for now, I'm closing this. Thank you for your report, feel free to reopen if you deem it necessary 😊

dureuill avatar Jan 22 '24 08:01 dureuill

I understand.

Funnily enough, since upgrading to 1.6.0-rc.3 there have been no new occurrences of this issue. Upgrade to rc.8 and then full release went smoothly, each running for weeks. When I made this ticket the service didn't even survive a full work day.

Quite mysterious.

Nextra avatar Jan 23 '24 07:01 Nextra

Thanks for the update. I can't think of anything in RC3 that would make things better. That's the good kind of mysterious though I guess 😅

dureuill avatar Jan 23 '24 08:01 dureuill