Azure cookbook uses a network share
Hello :wave:
It was reported that our Azure cookbook uses a network share, which is unsupported by our storage backend[^1] and possibly corrupts DBs.
Could we change the cookbook to either use a non-network type of disk, or to warn users of the corruption risk with this setup?
User report:
The Meilisarch Website provides a one click deployment ARM template in the documentation here:
https://www.meilisearch.com/docs/learn/cookbooks/azure
So you might also want to change the storage type there or warn users.
[^1]: Do not use LMDB databases on remote filesystems (source)
Thanks for reporting this, @dureuill!
This was originally authored by someone from Microsoft's team, if I understand correctly, and at the moment we have no experts familiar with Azure in the team. @tpayet, do you think you could get in touch with the original author and see if they'd be interested in updating this guide?
Hello it appears more people are affected https://github.com/meilisearch/meilisearch/issues/4274
If we're not able to update the guide, should we remove it and instead put a warning that Meilisearch has been unreliable on Azure, with a link to some of the issues including this one? This would prevent more users from encountering the issue by following our guide.
As for writing a new guide, I suppose we could ask the community for its input, with the added constraint that the disk should not be "exotic".
@cmaneu, if I remember correctly, you were the one writing that Azure cookbook 😁 Could you help us update it, as we do not have much expertise with Azure 😬 ?
Hello @tpayet and team, I would love to help you update the Azure cookbook to ensure it runs smoothly 😉.
To pick the best solution, I need to understand a bit more the disk requirements (I'm not familiar with LMDB internals). In Azure, most of the permanent storage happens to be through a kind of network. We have attached storage, but only on VMs (which needs maintenance over PaaS services).
We can work via this issue or setup a quick meeting, as you prefer.
Hello @cmaneu!
Thanks for lending us a hand, much appreciated!
Perhaps we can start by trying to clarify your questions here and see if we can give you all the information you need asynchronously?
@dureuill, would you be able to give a bit more detail on what exactly makes LMDB incompatible with network shares? Are there any workarounds you are aware of?
Hey y'all,
Basically, the only « official » information we have is this one:
Do not use LMDB databases on remote filesystems, even between processes on the same host. This breaks flock() on some OSes, possibly memory map sync, and certainly sync between programs on different hosts.
Which comes from here.
Howard Chu often answered questions about it on Twitter, and I don't think I've seen a single network disk working correctly with LMDB. But sadly, his account got banned, so I can't really search for tweets about Azure that would highlight the issue they have already encountered.
So, overall, I would say the only thing we know is that LMDB requires an actual disk and not a virtualized or network disk.
@cmaneu I'm really struggling to find information about disks on the Azure website (I even contacted the sales service, who didn't know what kind of disk each plan uses). Thus, I was wondering, can you 100% confirm that the basic B2/B3 plans come with a network disk and that none of the plans listed here works? :pleading_face:
Taking a step further, I'm not Howard, but my understanding is that correctly implementing memory mapping and file locking on network filesystems is also very hard, if possible. This is because these APIs are not designed with the network in mind.
I don't think there's a workaround besides "don't use network disks"
From your description, it seems this is sort of an intrinsic issue with network filesystems. If that's the case, do other search engines have the same problem? If not, how do they work around the issue?
It's an intrinsic issue with network filesystems when you rely on the filesystem too much. Or when you use LMDB in our case. I don't think any other search engine relies on LMDB :thinking: As far as I know:
- Typesense uses rocksDB for storage, but most data actually lives in RAM
- Elasticsearch uses Lucene, which works with network disks
- Manticoresearch has a full custom storage layer
Hello all, @irevoire I can confirm that All App Service Plans use shared, network storage. Fact is, even most Azure VMs are relying on network storages as well (called "managed disks"). For most applications, they do provide many benefits: redundancy, scalability, backup, ability to stop the VM or resize it, while keeping the disk, etc...
There is one way to ensure you are using an actual physical disk, directly attached to the VM Host: Ephemeral Disks. There is multiple ways to access them:
- On VMs and VMSSs (also called temp disk : https://learn.microsoft.com/en-us/azure/virtual-machines/ephemeral-os-disks and https://learn.microsoft.com/en-us/archive/blogs/mast/understanding-the-temporary-drive-on-windows-azure-virtual-machines
- On Azure Kubernetes Service: https://learn.microsoft.com/en-us/azure/aks/concepts-storage#ephemeral-os-disk
The name of these disk is clear about the purpose: It's ephemeral, which it means it's deleted once the the VM is shutdown.
Echoing what has been said in another issue (maybe by @dureuill), Meilisearch may not be used as a primary storage for your data. Having said that, if you have to re-index all your data each time you need to scale up or VM is discarded for whatever reason, it's not really efficient.
Without know your specific requirements, my first proposal would be to leverage Meilisearch snapshots to easily provision VMs with an already existing snapshot to restore the DB on the temp disk. The applicability of this solution may vary depend on your indexing strategy and scale requirements (if you need 50 meilisearch hosts and index new data in real time, that may be quite complex to implement. If you index one or twice a day, that may be manageable.
Thanks a lot, @cmaneu; that's a very valuable feedback; I didn't know Azure worked like that
@guimachiavelli With this new information, what should we do here? Maybe should we
- as first steps
- remove (or hide) the azure article
- add somewhere Meilisearch is NOT compatible with Azure
- as second step, if @cmaneu is available to help us (thank you again for your involvement in this 🙏), we could document the workaround to use Meilisearch on azure VMs
I mostly agree with you, @curquiza, but I'm actually unsure about documenting the snapshot workaround. As far as I understand it, it will result in an overall subpar experience, and possibly lead to trouble down the line. Wouldn't it be better to stop at "Azure and Meilisearch are not compatible" instead of risking users being disappointed with Azure and/or Meilisearch?
What's your technical opinion, @dureuill and @irevoire?
If the guide is well written I guess it could do it with a big warning at the top stating that it’ll be hard because most azure plans are not compatible with meilisearch 🤔 Overall I’m not against having a guide if it makes it really clear that you’re going on a road full of ambush
I wouldn't bother writing a guide at this point I'd document that Azure is unsupported, and say that there exists a workaround if you really need it, and point to this issue and especially the relevant message from @cmaneu in this issue.
I'm sad to see the option go, but we really can't do anything if our DB gets corrupted under our feet, and documenting the incompatibility is better than letting people think it will work.
We'll revisit the issue if we someday add an alternative backend that is more lenient with network shares
I'm a bit confused here. The EBS disks used by EC2 instances are also network storage. This is how every cloud provider works.
I was about to say the exact same thing than @knd775. All hyperscaler works in the same way (other "cloud providers" are usually operating on a much lower scale, and sell resources as "cloud" while technically it's just VMs in a single box.
I've answered to the specific question of storage. Yet I'm wondering several things:
- How many people are actually deploying Meilisearch on Azure: Do you have analytics on your side to get a ballpark idea?
- How many people are deploying Meilisearch on other hyperscalers (AWS, Google, ...)?
- Is there a way to reproduce the issue, or is there any "specifics" that trigger the issue (db size, ...) ? Is there a specific database size where the issue start to arise? For example, in an App Service, we can force a restart of the container, that would cause the process to (gracefully) shutdown.
- If so, we can test other services. For example Azure Container Instances - which runs on Kubernetes behind the scene - could have a different storage performance that may solve issue
How many people are actually deploying Meilisearch on Azure: Do you have analytics on your side to get a ballpark idea?
We have a few super anonymized analytics, and that’s not part of it, sadly. I have no idea. But from the issues and discussions on Discord I’ve seen, I would say probably no one; the very few people we’ve seen using Azure had to change to another provider because it was causing DB corruptions. And I would say I’ve seen less than 10 people trying to use Azure overall 🤔
How many people are deploying Meilisearch on other hyperscalers (AWS, Google, etc.)?
So, once again, I don’t have any number for the different providers; the only info I know about open source users is that we have:
- Around 400k new installations every month (A lot of them come from CI, I would say, probably github runners)
- More than 25k active installations (they’ve been and are still using the instance for some time)
I would say that most of them are running on a cloud provider.
Is there a way to reproduce the issue, or are there any "specifics" that trigger the issue (db size, ...) ? Is there a specific database size where the issue starts to arise?
Honestly, we don’t really know. The latest person we’ve seen who was having an issue with Azure was complaining that his instance was working well. But, after pushing his instance into the prod, he started getting strange behavior, like task processing, finishing, and then still being marked as enqueued in the task queue. So, I did not try to reproduce the issues much, but I guess problems start to arise when you have some traffic and probably write and read transactions simultaneously.
The EBS disks used by EC2 instances are also network storage. This is how every cloud provider works.
Is it possible that other providers have a full working implementation while Azure doesn’t? Because in the end, I guess it all comes down to this sentence on the LMDB documentation:
Do not use LMDB databases on remote filesystems, even between processes on the same host. This breaks flock() on some OSes, possibly memory map sync, and certainly sync between programs on different hosts.
If these operations were working well on Azure, there would be no issue, I guess 🤔
Also, FYI, I think most people are using something called block storage on AWS, which I have no idea what it really means in the end but from what I understood, it works, it also works on digital ocean, and I don’t know for other providers.
I can provide my own experience with meilisearch on azure.
My data is very small (< 100 Mb) and there doesn't seem to be anything specific that triggers it. I'll just wake up one day and some (or all) my indexes will be gone or corrupted. I will say, there was one time that my indexes got corrupted in two different environments at the exact same time. They were in two different app service pools in two different subscriptions, but within the same region and AZ.
Oh wow, that seems awful... And is there some kind of frequency to it? Like does it happen once a week or once a month? On a specific day when there is a low usage or during rush hours? 🤔
I had the task queue get corrupted twice in the span of ~6 months, and then I've had issues with indexes disappearing or becoming empty a couple of times in the last month. We're only using it in dev/test environments currently, so there's never much traffic. Unfortunately we weren't capturing logs until after the last occurrence, so I can't provide much more info until it happens again.
edit: I just remembered something else that might be relevant. Until this week our index configuration (like setting filterable, sortable, and searchable attributes) was done on the startup of our frontend. During local development, this could be quite frequent due to hot reload. We've since moved this to our backend where it now runs before a full re-index. It's possible these frequent config changes played a role in the more recent occurrences (which only affected a single environment)
- Is there a way to reproduce the issue, or is there any "specifics" that trigger the issue (db size, ...) ? Is there a specific database size where the issue start to arise? For example, in an App Service, we can force a restart of the container, that would cause the process to (gracefully) shutdown.
We have deployed meilisearch:1.6 as an azure app service with its own local storage (it is a network storage though, but not a dedicated file share like in the examples). To reproduce:
- deploy as app service
- fill index with data (just a few documents)
- scale up app service e.g. basic 2 to basic 3 https://github.com/meilisearch/meilisearch/issues/4123
- now meilisearch crashes and cannot restart
Assertion 'rc == 0' failed in mdb_page_dirty()
Just opened https://github.com/meilisearch/meilisearch/issues/4402 to report the issue on the engine side
More news here
I recently moved my application to production an had the same issues as @knd775 describes, first i thought that the configuring of the index on startup as described was the issue but now it happens over and over again. I also attached a file storage. Is there anything i could do to fix this?
I'm currently running on B1 to save costs but could move to higher tiers if needed.
Edit: I used the wrong mapping path for the file share i will give a update if the index fails again without doing any upgrading.
I thought we were in the clear, but it just happened again after months of stability. There was some Azure App Service maintenance in our region that caused a momentary drop in connectivity. Now, our keys don't work anymore. Surprisingly the indexes and task queue are fine this time. Everything works unless we try to call the keys endpoint
Hello @knd775—thanks for reminding me this issue is still open and that I need to update the docs to reflect our guidance for Azure and Meilisearch.
Since this is the docs repo, our focus is more on describing current behaviour. For a more helpful discussion on the subject of network drives, Azure, and Meilisearch, I'd recommend commenting on this issue in the engine repo.
If you need immediate support, I recommend contacting the team in the official Meilisearch Discord.
Since there's been no sign of life from Azure on this issue for some time, @gmourier and I are thinking about removing the Azure guide from the docs altogether, instead of endorsing something that may blow up on our users faces through no fault of their own.
Do you have any thoughts on this, @curquiza?
I share the same point of view, let's remove the guide if it's broken so can confuse more than help
I've had this problem for a while now. Once a week or so I must recreate my indices completely. :( Running on azure app service