AKS
AKS copied to clipboard
SQL Server detects physical RAM rather than pod limit in v1.25.5
Describe the bug We are running the same workloads on multiple clusters, most of them being 1.24.6 and the newest one on 1.25.5 (was 1.25.4 earlier today, but still the same issue).
The SQL Server (mcr.microsoft.com/mssql/server:2019-latest) reports the following message on startup:
2023-02-06 22:00:36.56 Server Detected **51449** MB of RAM. This is an informational message; no user action is required.
and later on fails with OOM error, even though the RAM limit is set to "4Gi"
To Reproduce
- Deploy SQL Server to the cluster v1.25.5
- Run a heavy workload that uses RAM, i.e. install DACPAC on SQL.
Expected behavior SQL detects proper amount of RAM
Screenshots
Environment (please complete the following information):
- Kubernetes version: 1.25.5
Action required from @Azure/aks-pm
I've pinged the SQL team: https://github.com/microsoft/mssql-docker/issues/814
Thanks. It works fine in AKS clusters on 1.24.6, so could be an issue in AKS, not the SQL.
This seems to be related: https://github.com/Azure/AKS/issues/3443#issuecomment-1471746964
This might help some of you, Kubernetes 1.25 included an update to use cgroups v2 api (cgroups is basically how Kubernetes passes settings to the containers)...
I too am having this issue on microk8s 1.25 and 1.26. Any resolutions?
Action required from @Azure/aks-pm
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Any updates on this? We are running into the same problem after upgrading our AKS cluster to version 1.25.6
Issue needing attention of @Azure/aks-leads
This issue has to be solved by the SQL Server team. SQL Server does not take the changes of cgroup v1 vs cgroup v2 changes into account. On cgroup v1 sql server works fine. WIth cgroup v2 it fails to limit itself to the resources available.
This issue has to be solved by the SQL Server team. SQL Server does not take the changes of cgroup v1 vs cgroup v2 changes into account. On cgroup v1 sql server works fine. WIth cgroup v2 it fails to limit itself to the resources available.
In Ubuntu 20.x cgroup v1 is supported. In Ubuntu 22.x cgroup v2 is supported. cgroup v2 is a breaking change. (cgroup configures, among others, the maximum memory a proces is allowed to use)
After upgrading AKS to 1.25+ the AKS hosts are using cgroup v2 (https://learn.microsoft.com/en-us/azure/aks/supported-kubernetes-versions?tabs=azure-cli#aks-components-breaking-changes-by-version)
Sql server is currently (june 2023) based in Ubuntu 20.x and is not supporting cgroup v2. So running a sql server container in AKS1.25+ is causing problems which can only be solved by Microsoft with upgrading Sql Server to Ubuntu 22+
I've created a temporary workaround to prevent OOMKilled errors on AKS 1.25 by adding a /var/opt/mssql/mssql.conf file containing the max memory sql is allowed to use. Example:
[memory]
memorylimitmb = 8096
_cgroupmax=$(cat /sys/fs/cgroup/memory.max)
_max_memory_in_mb=$(((_cgroupmax/1024/1024)/10*8))
echo "[memory]" > /var/opt/mssql/mssql.conf
echo "memorylimitmb = $_max_memory_in_mb" >> /var/opt/mssql/mssql.conf
With cgroup v2 the file /sys/fs/cgroup/memory.max
contains the maximum memory a process may use.
(the math: sql may only use 80% of the available memory. This math is used by Microsoft in cgroup v1)
How you add this conf file depends on your situation. We build our own sql server container based on the mssql containers and run this commands during startup of the container before sql server starts.
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
@Azure/aks-leads Any updates?
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads