helm
helm copied to clipboard
Stuck at "Initializing Nextcloud..." when attached to NFS PVC
Doing my best to dupe helm/charts#22920 over to the new repo as I am experiencing this issue as well. I have refined the details a bit, as this issue appears to be specifically related to NFS-based storage.
Describe the bug
When bringing up the nextcloud pod via the helm chart, the logs show the pod as being stuck at:
2020-08-31T19:00:42.054297154Z Configuring Redis as session handler
2020-08-31T19:00:42.098305129Z Initializing nextcloud 19.0.1.1 ...
Even backing out the liveness/readiness probes to over 5 minutes does not give If I instead switch the PVC to my storageClass for Rancher Longhorn (iSCSI) for example, the nextcloud install initializes in seconds.
Version of Helm and Kubernetes:
helm: v3.3.0 kubernetes: v1.18.6
Which chart:
nextcloud/helm
What happened:
- Namespace is created.
- Helm creates NFS PVC, or it is created manually
- Helm instantiates Nextcloud pod
- Nextcloud pod attaches PVC, and starts
- Nextcloud container is stuck at the above line
What you expected to happen:
Nextcloud finishes initialization Nextcloud files appear with correct permissions on NFS volume
How to reproduce it (as minimally and precisely as possible):
Set up an NFS provisioner:
helm install stable/nfs-client-provisioner nfs \
--set nfs.server=x.x.x.x --set nfs.path=<path>
OR Configure an NFS PV and PVC manually
apiVersion: v1
kind: PersistentVolume
metadata:
name: nextcloud-data
labels:
app: cloud
type: data
spec:
capacity:
storage: 100Ti
nfs:
path: <path>
server: <server>
mountOptions:
- async
- nfsvers=4.2
- noatime
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: nfs-manual
volumeMode: Filesystem
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: nextcloud-data
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Ti
storageClassName: nfs-manual
volumeMode: Filesystem
selector:
matchLabels:
app: cloud
type: data
Install nextcloud helm install -f values.yaml nextcloud/helm nextcloud --namespace=nextcloud
values.yaml:
image:
repository: nextcloud
tag: 19
readinessProbe:
initialDelaySeconds: 560
livenessProbe:
initialDelaySeconds: 560
resources:
requests:
cpu: 200m
memory: 500Mi
limits:
cpu: 2
memory: 1Gi
ingress:
enabled: true
annotations:
cert-manager.io/cluster-issuer: acme
kubernetes.io/ingress.class: nginx
# nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
hosts:
- "cloud.myhost.com"
tls:
- hosts:
- "cloud.myhost.com"
secretName: prod-cert
path: /
nextcloud:
username: admin
password: admin1
# datadir: /mnt/data
host: "cloud.myhost.com"
internalDatabase:
enabled: true
externalDatabase:
enabled: false
persistence:
enabled: true
# accessMode: ReadWriteMany
# storageClass: nfs-client if creating via provisioner
existingClaim: nextcloud-data # comment out if creating new PVC via provisioner
I will add as well that my example PV above includes:
mountOptions:
- async
- nfsvers=4.2
- noatime
These do not appear to affect (or improve) the NFS performance at all in this case. Based on the other deployments I have utilizing NFS, this seems odd.
Hello there,
I've got the same issue, NFS PVC works well with Nextcloud v17. But, as you @WojoInc with Nextcloud v19 i'm stuck at "Initializing Nextcloud...".
Even if installation seems fail, and the Pod loop on restart , My NFS volume seems wrotten by Nextcloud v19 data. I'm trying now to get more verbosity about that.
Have a nice time :)
Hi,
I faced the same problem. I logged in to the physical node and watched the docker logs. There I saw that Nextcloud tried to connect via HTTP to the defined Host. I have HAProxy (OPNSense) in front of Kubernetes and redirect all HTTP to HTTPS. This was an issue. For the init process of Nextcloud I temporary added the HTTP rule for it and the process completed without problems.
Maybe you have a similar setup?
BR Scizoo
Hello @Scizoo88,
Thanks for sharing your experience. I don't think I've that setup, because my Nextcloud 19 pod, without NFS PVC for now, is accessible both via HTTP and HTTPS.
In my case, the unique difference between a working and not working setup, is that i've enabled data persistence (if I choose Nextcloud v19). Persistence greatly worked on Nextcloud 17, with the same Kubernetes network setup tought
Have a nice day,
Okeii I've managed to connect with externalDB, Nextcloud 19 seems installed and functionnal pretty well, PVC enabled. Maybe this error is SQLite related.
Hi guys, I already checked this. We´re using a fixed fsGroup for the apache and the nginx container. Because nextcloud copies files around via rsync on startup it relies on valid permissions to the volumes.
But in my case the user id and groups on my nfs client mount are different. My logs show permission denied errors.
I see two possible solutions:
- add a sidecar or possibility for generic sidecar containers to make somethink like chown -R ....
- try to use securityContext.fsGroupChangePolicy = Always (kubernetes 1.18 alpha)
For the moment I would tend to go for sidecar possibility so that you guys can handle volume permissions by yourself.
Best
I seem to run in to errors with permissions even when the nfs mount is owned by www-data. I have tried manually editing the securityContext to set the fsGroupChangePolicy, and this didn't seem to resolve the issue either. I'll dive in a bit more and test out whether a side car or init container could set the permissions correctly.
I seem to have resolved the performance issues around the use of NFS. Rsync was being forced to use synchronous writes due to NFS default behavior and how rsync checks for copied files. CP was slightly faster, but the real fix was enabling the async
option on the NFS export (I had only been adding this option to the Persistent Volume), at least for the initial install. This took the time to initialize nextcloud down from >15 mins to just under 10 seconds.
I plan to test the permissions if the permissions are still an issue now.
I'm experiencing the same problem, I tried to change the securityContext params but that didn't solve the problem...
I think I'm having the same issue:
- the container is being periodically restarted
- the only output to the log is "Initializing nextcloud 19.0.3.1 ..."
- the PVC is automatically created from my NFS storage class
I'll try adding the async
option to the host and PV, then report back.
Edit: having trouble adding async
to my NFS server because of the storage class provider I'm using.
@WojoInc Could you explain how you changed the NFS export options?
also looking for guidance here, seeing a permission issue that i'm not sure is an easy solve as i'm also using a nfs-provisioner
kubectl logs nextcloud-7969756654-7j9xh --tail 50 -f
Initializing nextcloud 19.0.4.2 ...
Upgrading nextcloud from 17.0.0.9 ...
Initializing finished
Console has to be executed with the user that owns the file config/config.php
Current user: www-data
Owner of config.php: root
Try adding 'sudo -u root ' to the beginning of the command (without the single quotes)
If running with 'docker exec' try adding the option '-u root' to the docker command (without the single quotes)
i would go change the default permissions of NFS but all other pods using NFS would run into issues then. Previously you discussed options to change the storage owner via a sidecar or fsGroupChangePolicy. Can you please expand on how this is accomplished?
I have the same issue, and the container does not contain any log file. Any workaround for this?
EDIT: the issue appear to come from the livenessProbe delay being too low, the initialization does not have time to finish. Disabling both livenessProbe and readinessProbe worked for me (Nextcloud 19-apache):
livenessProbe:
enabled: false
readinessProbe:
enabled: false
I seem to have resolved the performance issues around the use of NFS. Rsync was being forced to use synchronous writes due to NFS default behavior and how rsync checks for copied files. CP was slightly faster, but the real fix was enabling the
async
option on the NFS export (I had only been adding this option to the Persistent Volume), at least for the initial install. This took the time to initialize nextcloud down from >15 mins to just under 10 seconds.I plan to test the permissions if the permissions are still an issue now.
@WojoInc Are you using the nextcloud helm chart with replication set to e.g. 3?
I'm using the following configuration on the helm chart using terraform to set up the release:
resource "kubernetes_namespace" "ns_files" {
metadata {
name = "files"
}
}
resource "helm_release" "rel_files_cloud" {
repository = "https://nextcloud.github.io/helm/"
name="cloudfiles"
chart = "nextcloud"
namespace="files"
values = [
<<YAML
ingress:
enabled: true
annotations:
kubernetes.io/ingress.class: traefik
cert-manager.io/cluster-issuer: cluster-issuer
traefik.ingress.kubernetes.io/redirect-entry-point: https
traefik.frontend.passHostHeader: "true"
tls:
- hosts:
- files.haus.net
secretName: nextcloud-app-tls
YAML
]
set {
name = "nextcloud.host"
value = "files.haus.net"
}
set {
name = "nextcloud.username"
value = "vault:secret/data/nextcloud/app/credentials#app_user"
}
set {
name = "nextcloud.password"
value = "vault:secret/data/nextcloud/app/credentials#app_password"
}
set {
name = "mariadb.enabled"
value = "true"
}
set {
name = "mariadb.db.password"
value = "vault:secret/data/nextcloud/db/credentials#db_password"
}
set {
name = "mariadb.db.user"
value = "vault:secret/data/nextcloud/db/credentials#db_user"
}
set {
name = "mariadb.master.persistence.storageClass"
value = "nfs-client"
}
set {
name = "mariadb.master.annotations.vault\\.security\\.banzaicloud\\.io/vault-addr"
value = "https://vault.vault-system:8200"
}
set {
name = "mariadb.master.annotations.vault\\.security\\.banzaicloud\\.io/vault-tls-secret"
value = "vault-cert-tls"
}
set {
name = "mariadb.master.annotations.vault\\.security\\.banzaicloud\\.io/vault-role"
value = "default"
}
set {
name = "persistence.enabled"
value = "true"
}
set {
name = "persistence.storageClass"
value = "nfs-client"
}
set {
name = "persistence.size"
value = "2.5Ti"
}
set {
name = "podAnnotations.vault\\.security\\.banzaicloud\\.io/vault-addr"
value = "https://vault.vault-system:8200"
}
set {
name = "podAnnotations.vault\\.security\\.banzaicloud\\.io/vault-tls-secret"
value = "vault-cert-tls"
}
set {
name = "podAnnotations.vault\\.security\\.banzaicloud\\.io/vault-role"
value = "default"
}
}
I end up with the following log for nextcloud:
time="2020-12-15T23:02:34Z" level=info msg="received new Vault token" app=vault-env
time="2020-12-15T23:02:35Z" level=info msg="initial Vault token arrived" app=vault-env
time="2020-12-15T23:02:35Z" level=info msg="spawning process: [/entrypoint.sh apache2-foreground]" app=vault-env
Initializing nextcloud 19.0.3.1 ...
I check the nfs-client-provisioner
and notice that the folders have the following permissions:
/mnt/external/files-cloudfiles-nextcloud-nextcloud-pvc-646eb797-7470-4dd3-94cc-590b9ca5a074# ll
total 36
drwxrwxrwx 9 root root 4096 Dec 15 22:47 ./
drwxr-xr-x 13 root root 4096 Dec 15 23:07 ../
drwxrwxrwx 2 root root 4096 Dec 15 22:47 config/
drwxrwxrwx 2 root root 4096 Dec 15 22:47 custom_apps/
drwxrwxrwx 2 root root 4096 Dec 15 22:47 data/
drwxrwxrwx 8 www-data root 4096 Dec 15 23:02 html/
drwxrwxrwx 4 root root 4096 Dec 15 22:47 root/
drwxrwxrwx 2 root root 4096 Dec 15 22:47 themes/
drwxrwxrwx 2 root root 4096 Dec 15 22:47 tmp/
My /etc/exports
has the following configuration
/mnt/external 192.168.0.120/32(rw,no_root_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000) 172.16.0.0/29(rw,no_root_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000) 10.42.0.0/16(rw,no_root_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000)
I'm not using the Helm chart, I've just manually created a Deployment
for NC with an nfs-client-provisioner
volume but I experience the same issue. In my case, I moved a previous NC install to k8s, so my log output consists of the initializing line then an upgrading line. Then stuck forever. Execing inside the pod, and running top
, it seems an rsync
command is running forever.
What's most disturbing is that the S and D statuses mean sleep and uninterruptible sleep, so it seems all the syncs are not doing anything. Also tried setting fsGroup
to 33 but nothing changes, and the existing files are at the right permission from the previous non-k8s install I think.
root@nextcloud-55c6cb7cbd-d9cmv:/var/www/html# ps aux --width 200
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 2388 1444 ? Ss 18:45 0:00 /bin/sh /entrypoint.sh /usr/bin/supervisord -c /supervisord.conf
root 32 0.0 0.1 114460 12568 ? S 18:45 0:00 rsync -rlDog --chown www-data:root --delete --exclude-from=/upgrade.exclude /usr/src/nextcloud/ /var/www/html/
root 33 0.0 0.1 126596 8372 ? S 18:45 0:00 rsync -rlDog --chown www-data:root --delete --exclude-from=/upgrade.exclude /usr/src/nextcloud/ /var/www/html/
root 34 0.2 0.0 114620 3796 ? D 18:45 0:01 rsync -rlDog --chown www-data:root --delete --exclude-from=/upgrade.exclude /usr/src/nextcloud/ /var/www/html/
root 63 0.0 0.0 4000 3076 pts/0 Ss 18:53 0:00 bash
root 72 0.0 0.0 7640 2664 pts/0 R+ 18:54 0:00 ps aux --width 200
I am having the same issue with v20.0.4
.
I was using NFSv4.2 in my previous try. When I fixed the version to NFSv3, it went further, but then also stuck with an occ PHP command. Symptoms are the same, deep sleep of the PHP thread. My NFS server is in a privileged CentOS Stream LXC container in Proxmox with NFS and FUSE feature enabled and backed by a bind mount from the host. When NC was stuck on both NFSv3/v4, I've seen a kernel panic with NFS logs in it on the host, and I couldn't use NFS reliably further. Restarting the whole host helped only to stabilize it. This is both absurd and funny at once: starting NC in k8s collapsing the whole hypervisor, lol 😃
I was using NFSv4.2 in my previous try. When I fixed the version to NFSv3, it went further, but then also stuck with an occ PHP command. Symptoms are the same, deep sleep of the PHP thread. My NFS server is in a privileged CentOS Stream LXC container in Proxmox with NFS and FUSE feature enabled and backed by a bind mount from the host. When NC was stuck on both NFSv3/v4, I've seen a kernel panic with NFS logs in it on the host, and I couldn't use NFS reliably further. Restarting the whole host helped only to stabilize it. This is both absurd and funny at once: starting NC in k8s collapsing the whole hypervisor, lol 😃
Can you also replicate huge load average numbers when running nextcloud with NFS in a k8s cluster for at least one week? I've to also restart the node because I get 100 on load average for some unknown reason due to nextcloud.
I has never started up to get to the web interface, just stuck in either rsync phase (NFSv4.2) or an occ command (NFSv3). The CPU usage was minuscule (~1m).
I has never started up to get to the web interface, just stuck in either rsync phase (NFSv4.2) or an occ command (NFSv3). The CPU usage was minuscule (~1m).
CPU usage is not the only component to check. Look at the load average on htop while it's doing its thing.
It really didn't do anything, all threads were sleeping in top
(S + D flags).
I have had the same error and was able to resolve it by fixing /etc/exports. I was also using the nfs-provisioner.
My previous /etc/exports file was
/mnt/nfsdir -async,no_subtree_check *(rw,insecure,sync,no_subtree_check,no_root_squash)
I changed it to the rancher /etc/exports example and I was able to deploy nextcloud successfully.
/mnt/nfsdir *(rw,sync,no_subtree_check,no_root_squash)
I've been having this issue as well. I think it's caused by three things:
- the rsyncs are done on folders with huge amounts of files
- every file causes quite some IO-operations (
lstat
,open
,write
,close
) - every IO-operation needs to go "over the wire" before rsync continues, as the volume is backed by NFS
When I look at my nfsd stats (grafana/prometheus/node-exporter), there is a lot (+/- 50% of the IOPS) of GetAttr
(caused by lstat
syscalls) going on during the rsync. When using block-based volumes, these are served from local cache, which is magnitudes quicker.
Sure, async,noatime
will improve things, and maybe even throw in NFS3, but in the end you're rsyncing a truckload of files onto an NFS share, and that's not very efficient.
I'd suggest to enable the startupProbe
, and tweak the periodSeconds
and failureThreshold
. This is probably better than tweaking/disabling the readiness/liveness probes.
Same issue with a Kadalu backend. I set the initial delay to a day, let's see what happens...
-- edit two hours to initialize.
Is there any solution for this issue?
I tried every suggestion with no success :(
Don't believe anyone when they tell you NFS or CIFS works with file locking. Inevitably you will experience data corruption. I recommend a solution such as longhorn or similar in a kubernetes environment. It will use local storage on each worker node and iscsi behind the scene as needed to create your pvc.
We all start out using NFS in linux world but it just doesn't support full file locking. iSCSI takes some time to learn. You'd be might be better off using something like longhorn and letting it do the iscsi for you. Seriously, abandon NFS, don't waste anymore of your life trying to get it to work.
I can't even begin to tell you how fast and flawlessly everything works with iSCSI and how nice it is to have the slowness of NFS and inevitable bizarre failures of NFS behind me. Make the change. Do it, do it now. (or just buy a network storage device which uses iSCSI)
https://forums.plex.tv/t/roadmap-to-allow-network-share-for-configuration-data/761162
** Update that I wanted to note that I learned if you have the right nfs-specific hardware that NFS can perform as quickly as iSCSI. Also that vmware has some sort of protection it adds to its nfs shares so if using those they actually do support full file locking. Also that longhorn isn't perfect and it tries to use NFS with its RWX shares (sigh), but RWO w/ longhorn works. Think I'm going to switch to rook / ceph.
Locking is not the issue here, it's the fact that lstat
is not served by a local FS or cache.
I think both NFS and block based solutions have their place, even in a Kubernetes context, and both come with their unique advantages and problems. In this (specific) case I totally agree with you: a block based solution will not have this problem.
It's a permission issue I think. The pod fails with:
rsync: [receiver] chown "/var/www/html/resources/config/.mimetypealiases.dist.json.bYpaGG" failed: Operation not permitted (1)
rsync: [receiver] chown "/var/www/html/resources/config/.mimetypemapping.dist.json.ChHk9F" failed: Operation not permitted (1)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1333) [sender=3.2.3]
All files are synced but because rsync can't do chown it returns a non-zero code.