helm Stuck at "Initializing Nextcloud..." when attached to NFS PVC

Doing my best to dupe helm/charts#22920 over to the new repo as I am experiencing this issue as well. I have refined the details a bit, as this issue appears to be specifically related to NFS-based storage.

Describe the bug

When bringing up the nextcloud pod via the helm chart, the logs show the pod as being stuck at:

2020-08-31T19:00:42.054297154Z Configuring Redis as session handler
2020-08-31T19:00:42.098305129Z Initializing nextcloud 19.0.1.1 ...

Even backing out the liveness/readiness probes to over 5 minutes does not give If I instead switch the PVC to my storageClass for Rancher Longhorn (iSCSI) for example, the nextcloud install initializes in seconds.

Version of Helm and Kubernetes:

helm: v3.3.0 kubernetes: v1.18.6

Which chart:

nextcloud/helm

What happened:

Namespace is created.
Helm creates NFS PVC, or it is created manually
Helm instantiates Nextcloud pod
Nextcloud pod attaches PVC, and starts
Nextcloud container is stuck at the above line

What you expected to happen:

Nextcloud finishes initialization Nextcloud files appear with correct permissions on NFS volume

How to reproduce it (as minimally and precisely as possible):

Set up an NFS provisioner:

helm install stable/nfs-client-provisioner nfs  \
--set nfs.server=x.x.x.x --set nfs.path=<path>

OR Configure an NFS PV and PVC manually

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nextcloud-data
  labels:
    app: cloud
    type: data
spec:
  capacity:
    storage: 100Ti
  nfs:
    path: <path>
    server: <server>
  mountOptions:
    - async
    - nfsvers=4.2
    - noatime
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: nfs-manual
  volumeMode: Filesystem
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: nextcloud-data
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100Ti
  storageClassName: nfs-manual
  volumeMode: Filesystem
  selector:
    matchLabels:
      app: cloud
      type: data

Install nextcloud helm install -f values.yaml nextcloud/helm nextcloud --namespace=nextcloud

values.yaml:

image:
  repository: nextcloud
  tag: 19
readinessProbe:
  initialDelaySeconds: 560
livenessProbe:
  initialDelaySeconds: 560
resources:
  requests:
    cpu: 200m
    memory: 500Mi
  limits:
    cpu: 2
    memory: 1Gi
ingress:
  enabled: true
  annotations:
    cert-manager.io/cluster-issuer: acme
    kubernetes.io/ingress.class: nginx
    # nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
  hosts:
    - "cloud.myhost.com"
  tls:
    - hosts:
        - "cloud.myhost.com"
      secretName: prod-cert
  path: /
nextcloud:
  username: admin
  password: admin1
  # datadir: /mnt/data
  host: "cloud.myhost.com"
internalDatabase:
  enabled: true
externalDatabase:
  enabled: false
persistence:
  enabled: true
  # accessMode: ReadWriteMany
  # storageClass: nfs-client if creating via provisioner
  existingClaim: nextcloud-data # comment out if creating new PVC via provisioner

Aug 31 '20 19:08 somerandow

I will add as well that my example PV above includes:

  mountOptions:
    - async
    - nfsvers=4.2
    - noatime

These do not appear to affect (or improve) the NFS performance at all in this case. Based on the other deployments I have utilizing NFS, this seems odd.

Aug 31 '20 19:08 somerandow

Hello there,

I've got the same issue, NFS PVC works well with Nextcloud v17. But, as you @WojoInc with Nextcloud v19 i'm stuck at "Initializing Nextcloud...".

Even if installation seems fail, and the Pod loop on restart , My NFS volume seems wrotten by Nextcloud v19 data. I'm trying now to get more verbosity about that.

Have a nice time :)

Sep 01 '20 15:09 thunerbl

Hi,

I faced the same problem. I logged in to the physical node and watched the docker logs. There I saw that Nextcloud tried to connect via HTTP to the defined Host. I have HAProxy (OPNSense) in front of Kubernetes and redirect all HTTP to HTTPS. This was an issue. For the init process of Nextcloud I temporary added the HTTP rule for it and the process completed without problems.

Maybe you have a similar setup?

BR Scizoo

Sep 02 '20 21:09 Scizoo88

Hello @Scizoo88,

Thanks for sharing your experience. I don't think I've that setup, because my Nextcloud 19 pod, without NFS PVC for now, is accessible both via HTTP and HTTPS.

In my case, the unique difference between a working and not working setup, is that i've enabled data persistence (if I choose Nextcloud v19). Persistence greatly worked on Nextcloud 17, with the same Kubernetes network setup tought

Have a nice day,

Sep 04 '20 06:09 thunerbl

Okeii I've managed to connect with externalDB, Nextcloud 19 seems installed and functionnal pretty well, PVC enabled. Maybe this error is SQLite related.

Sep 16 '20 11:09 thunerbl

Hi guys, I already checked this. We´re using a fixed fsGroup for the apache and the nginx container. Because nextcloud copies files around via rsync on startup it relies on valid permissions to the volumes.

But in my case the user id and groups on my nfs client mount are different. My logs show permission denied errors.

I see two possible solutions:

add a sidecar or possibility for generic sidecar containers to make somethink like chown -R ....
try to use securityContext.fsGroupChangePolicy = Always (kubernetes 1.18 alpha)

For the moment I would tend to go for sidecar possibility so that you guys can handle volume permissions by yourself.

Best

Sep 16 '20 11:09 chrisingenhaag

I seem to run in to errors with permissions even when the nfs mount is owned by www-data. I have tried manually editing the securityContext to set the fsGroupChangePolicy, and this didn't seem to resolve the issue either. I'll dive in a bit more and test out whether a side car or init container could set the permissions correctly.

Oct 05 '20 20:10 somerandow

I seem to have resolved the performance issues around the use of NFS. Rsync was being forced to use synchronous writes due to NFS default behavior and how rsync checks for copied files. CP was slightly faster, but the real fix was enabling the async option on the NFS export (I had only been adding this option to the Persistent Volume), at least for the initial install. This took the time to initialize nextcloud down from >15 mins to just under 10 seconds.

I plan to test the permissions if the permissions are still an issue now.

Oct 08 '20 18:10 somerandow

I'm experiencing the same problem, I tried to change the securityContext params but that didn't solve the problem...

Oct 08 '20 18:10 J3m5

I think I'm having the same issue:

the container is being periodically restarted
the only output to the log is "Initializing nextcloud 19.0.3.1 ..."
the PVC is automatically created from my NFS storage class

I'll try adding the async option to the host and PV, then report back.

Edit: having trouble adding async to my NFS server because of the storage class provider I'm using.

Oct 20 '20 05:10 davad

@WojoInc Could you explain how you changed the NFS export options?

Nov 04 '20 20:11 unixfox

also looking for guidance here, seeing a permission issue that i'm not sure is an easy solve as i'm also using a nfs-provisioner

kubectl logs nextcloud-7969756654-7j9xh --tail 50 -f
Initializing nextcloud 19.0.4.2 ...
Upgrading nextcloud from 17.0.0.9 ...
Initializing finished
Console has to be executed with the user that owns the file config/config.php
Current user: www-data
Owner of config.php: root
Try adding 'sudo -u root ' to the beginning of the command (without the single quotes)
If running with 'docker exec' try adding the option '-u root' to the docker command (without the single quotes)

i would go change the default permissions of NFS but all other pods using NFS would run into issues then. Previously you discussed options to change the storage owner via a sidecar or fsGroupChangePolicy. Can you please expand on how this is accomplished?

Nov 06 '20 17:11 sOblivionsCall

I have the same issue, and the container does not contain any log file. Any workaround for this?

EDIT: the issue appear to come from the livenessProbe delay being too low, the initialization does not have time to finish. Disabling both livenessProbe and readinessProbe worked for me (Nextcloud 19-apache):

livenessProbe:
  enabled: false
readinessProbe:
  enabled: false

Nov 09 '20 18:11 sundowndev

I seem to have resolved the performance issues around the use of NFS. Rsync was being forced to use synchronous writes due to NFS default behavior and how rsync checks for copied files. CP was slightly faster, but the real fix was enabling the async option on the NFS export (I had only been adding this option to the Persistent Volume), at least for the initial install. This took the time to initialize nextcloud down from >15 mins to just under 10 seconds.

I plan to test the permissions if the permissions are still an issue now.

@WojoInc Are you using the nextcloud helm chart with replication set to e.g. 3?

Dec 08 '20 11:12 Janl1

I'm using the following configuration on the helm chart using terraform to set up the release:

resource "kubernetes_namespace" "ns_files" {
  metadata {
    name = "files"
  }
}

resource "helm_release" "rel_files_cloud" {
  repository = "https://nextcloud.github.io/helm/"
  name="cloudfiles"
  chart = "nextcloud"
  namespace="files"

  values = [
      <<YAML
        ingress:
          enabled: true
          annotations:
            kubernetes.io/ingress.class: traefik
            cert-manager.io/cluster-issuer: cluster-issuer
            traefik.ingress.kubernetes.io/redirect-entry-point: https
            traefik.frontend.passHostHeader: "true"
          tls:
            - hosts:
              - files.haus.net
              secretName: nextcloud-app-tls
      YAML
   ]

  set {
    name = "nextcloud.host"
    value = "files.haus.net"
  }

  set {
      name = "nextcloud.username"
      value = "vault:secret/data/nextcloud/app/credentials#app_user"
  }
  set {
      name = "nextcloud.password"
      value = "vault:secret/data/nextcloud/app/credentials#app_password"
  }
  set {
      name = "mariadb.enabled"
      value = "true"
  }
  set {
      name = "mariadb.db.password"
      value = "vault:secret/data/nextcloud/db/credentials#db_password"
  }
  set {
      name = "mariadb.db.user"
      value = "vault:secret/data/nextcloud/db/credentials#db_user"
  }
  set {
      name = "mariadb.master.persistence.storageClass"
      value = "nfs-client"
  }
  set {
      name = "mariadb.master.annotations.vault\\.security\\.banzaicloud\\.io/vault-addr"
      value = "https://vault.vault-system:8200"
  }
  set {
      name = "mariadb.master.annotations.vault\\.security\\.banzaicloud\\.io/vault-tls-secret"
      value = "vault-cert-tls"
  }
  set {
      name = "mariadb.master.annotations.vault\\.security\\.banzaicloud\\.io/vault-role"
      value = "default"
  }
  set {
      name = "persistence.enabled"
      value = "true"
  }
  set {
      name = "persistence.storageClass"
      value = "nfs-client"
  }
  set {
      name = "persistence.size"
      value = "2.5Ti"
  }
  set {
      name = "podAnnotations.vault\\.security\\.banzaicloud\\.io/vault-addr"
      value = "https://vault.vault-system:8200"
  }
  set {
      name = "podAnnotations.vault\\.security\\.banzaicloud\\.io/vault-tls-secret"
      value = "vault-cert-tls"
  }
  set {
      name = "podAnnotations.vault\\.security\\.banzaicloud\\.io/vault-role"
      value = "default"
  }
}

I end up with the following log for nextcloud:

time="2020-12-15T23:02:34Z" level=info msg="received new Vault token" app=vault-env
time="2020-12-15T23:02:35Z" level=info msg="initial Vault token arrived" app=vault-env
time="2020-12-15T23:02:35Z" level=info msg="spawning process: [/entrypoint.sh apache2-foreground]" app=vault-env
Initializing nextcloud 19.0.3.1 ...

I check the nfs-client-provisioner and notice that the folders have the following permissions:

/mnt/external/files-cloudfiles-nextcloud-nextcloud-pvc-646eb797-7470-4dd3-94cc-590b9ca5a074# ll
total 36
drwxrwxrwx  9 root     root 4096 Dec 15 22:47 ./
drwxr-xr-x 13 root     root 4096 Dec 15 23:07 ../
drwxrwxrwx  2 root     root 4096 Dec 15 22:47 config/
drwxrwxrwx  2 root     root 4096 Dec 15 22:47 custom_apps/
drwxrwxrwx  2 root     root 4096 Dec 15 22:47 data/
drwxrwxrwx  8 www-data root 4096 Dec 15 23:02 html/
drwxrwxrwx  4 root     root 4096 Dec 15 22:47 root/
drwxrwxrwx  2 root     root 4096 Dec 15 22:47 themes/
drwxrwxrwx  2 root     root 4096 Dec 15 22:47 tmp/

My /etc/exports has the following configuration

/mnt/external 192.168.0.120/32(rw,no_root_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000) 172.16.0.0/29(rw,no_root_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000) 10.42.0.0/16(rw,no_root_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000)

Dec 15 '20 23:12 mikeyGlitz

I'm not using the Helm chart, I've just manually created a Deployment for NC with an nfs-client-provisioner volume but I experience the same issue. In my case, I moved a previous NC install to k8s, so my log output consists of the initializing line then an upgrading line. Then stuck forever. Execing inside the pod, and running top, it seems an rsync command is running forever.

Jan 02 '21 18:01 immanuelfodor

What's most disturbing is that the S and D statuses mean sleep and uninterruptible sleep, so it seems all the syncs are not doing anything. Also tried setting fsGroup to 33 but nothing changes, and the existing files are at the right permission from the previous non-k8s install I think.

root@nextcloud-55c6cb7cbd-d9cmv:/var/www/html# ps aux --width 200              
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND     
root           1  0.0  0.0   2388  1444 ?        Ss   18:45   0:00 /bin/sh /entrypoint.sh /usr/bin/supervisord -c /supervisord.conf                           
root          32  0.0  0.1 114460 12568 ?        S    18:45   0:00 rsync -rlDog --chown www-data:root --delete --exclude-from=/upgrade.exclude /usr/src/nextcloud/ /var/www/html/                                                            
root          33  0.0  0.1 126596  8372 ?        S    18:45   0:00 rsync -rlDog --chown www-data:root --delete --exclude-from=/upgrade.exclude /usr/src/nextcloud/ /var/www/html/                                                            
root          34  0.2  0.0 114620  3796 ?        D    18:45   0:01 rsync -rlDog --chown www-data:root --delete --exclude-from=/upgrade.exclude /usr/src/nextcloud/ /var/www/html/                                                            
root          63  0.0  0.0   4000  3076 pts/0    Ss   18:53   0:00 bash        
root          72  0.0  0.0   7640  2664 pts/0    R+   18:54   0:00 ps aux --width 200

Jan 02 '21 19:01 immanuelfodor

I am having the same issue with v20.0.4.

Jan 17 '21 02:01 maxirus

I was using NFSv4.2 in my previous try. When I fixed the version to NFSv3, it went further, but then also stuck with an occ PHP command. Symptoms are the same, deep sleep of the PHP thread. My NFS server is in a privileged CentOS Stream LXC container in Proxmox with NFS and FUSE feature enabled and backed by a bind mount from the host. When NC was stuck on both NFSv3/v4, I've seen a kernel panic with NFS logs in it on the host, and I couldn't use NFS reliably further. Restarting the whole host helped only to stabilize it. This is both absurd and funny at once: starting NC in k8s collapsing the whole hypervisor, lol 😃

Jan 17 '21 04:01 immanuelfodor

I was using NFSv4.2 in my previous try. When I fixed the version to NFSv3, it went further, but then also stuck with an occ PHP command. Symptoms are the same, deep sleep of the PHP thread. My NFS server is in a privileged CentOS Stream LXC container in Proxmox with NFS and FUSE feature enabled and backed by a bind mount from the host. When NC was stuck on both NFSv3/v4, I've seen a kernel panic with NFS logs in it on the host, and I couldn't use NFS reliably further. Restarting the whole host helped only to stabilize it. This is both absurd and funny at once: starting NC in k8s collapsing the whole hypervisor, lol 😃

Can you also replicate huge load average numbers when running nextcloud with NFS in a k8s cluster for at least one week? I've to also restart the node because I get 100 on load average for some unknown reason due to nextcloud.

Jan 17 '21 08:01 unixfox

I has never started up to get to the web interface, just stuck in either rsync phase (NFSv4.2) or an occ command (NFSv3). The CPU usage was minuscule (~1m).

Jan 17 '21 09:01 immanuelfodor

I has never started up to get to the web interface, just stuck in either rsync phase (NFSv4.2) or an occ command (NFSv3). The CPU usage was minuscule (~1m).

CPU usage is not the only component to check. Look at the load average on htop while it's doing its thing.

Jan 17 '21 09:01 unixfox

It really didn't do anything, all threads were sleeping in top (S + D flags).

Jan 17 '21 10:01 immanuelfodor

I have had the same error and was able to resolve it by fixing /etc/exports. I was also using the nfs-provisioner.

My previous /etc/exports file was

/mnt/nfsdir -async,no_subtree_check *(rw,insecure,sync,no_subtree_check,no_root_squash)

I changed it to the rancher /etc/exports example and I was able to deploy nextcloud successfully.

/mnt/nfsdir    *(rw,sync,no_subtree_check,no_root_squash)

Feb 23 '21 19:02 dentropy

I've been having this issue as well. I think it's caused by three things:

the rsyncs are done on folders with huge amounts of files
every file causes quite some IO-operations (lstat, open, write, close)
every IO-operation needs to go "over the wire" before rsync continues, as the volume is backed by NFS

When I look at my nfsd stats (grafana/prometheus/node-exporter), there is a lot (+/- 50% of the IOPS) of GetAttr (caused by lstat syscalls) going on during the rsync. When using block-based volumes, these are served from local cache, which is magnitudes quicker.

Sure, async,noatime will improve things, and maybe even throw in NFS3, but in the end you're rsyncing a truckload of files onto an NFS share, and that's not very efficient.

I'd suggest to enable the startupProbe, and tweak the periodSeconds and failureThreshold. This is probably better than tweaking/disabling the readiness/liveness probes.

Mar 23 '21 13:03 jonkerj

Same issue with a Kadalu backend. I set the initial delay to a day, let's see what happens...

-- edit two hours to initialize.

Oct 30 '21 13:10 danielvandenberg95

Is there any solution for this issue?

I tried every suggestion with no success :(

Nov 22 '21 06:11 dcardellino

Don't believe anyone when they tell you NFS or CIFS works with file locking. Inevitably you will experience data corruption. I recommend a solution such as longhorn or similar in a kubernetes environment. It will use local storage on each worker node and iscsi behind the scene as needed to create your pvc.

We all start out using NFS in linux world but it just doesn't support full file locking. iSCSI takes some time to learn. You'd be might be better off using something like longhorn and letting it do the iscsi for you. Seriously, abandon NFS, don't waste anymore of your life trying to get it to work.

I can't even begin to tell you how fast and flawlessly everything works with iSCSI and how nice it is to have the slowness of NFS and inevitable bizarre failures of NFS behind me. Make the change. Do it, do it now. (or just buy a network storage device which uses iSCSI)

https://forums.plex.tv/t/roadmap-to-allow-network-share-for-configuration-data/761162

** Update that I wanted to note that I learned if you have the right nfs-specific hardware that NFS can perform as quickly as iSCSI. Also that vmware has some sort of protection it adds to its nfs shares so if using those they actually do support full file locking. Also that longhorn isn't perfect and it tries to use NFS with its RWX shares (sigh), but RWO w/ longhorn works. Think I'm going to switch to rook / ceph.

Dec 25 '21 00:12 lknite

Locking is not the issue here, it's the fact that lstat is not served by a local FS or cache.

I think both NFS and block based solutions have their place, even in a Kubernetes context, and both come with their unique advantages and problems. In this (specific) case I totally agree with you: a block based solution will not have this problem.

Dec 25 '21 12:12 jonkerj

It's a permission issue I think. The pod fails with:

rsync: [receiver] chown "/var/www/html/resources/config/.mimetypealiases.dist.json.bYpaGG" failed: Operation not permitted (1)
rsync: [receiver] chown "/var/www/html/resources/config/.mimetypemapping.dist.json.ChHk9F" failed: Operation not permitted (1)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1333) [sender=3.2.3]

All files are synced but because rsync can't do chown it returns a non-zero code.

Jan 24 '22 15:01 devent

helm helm copied to clipboard

Stuck at "Initializing Nextcloud..." when attached to NFS PVC

Describe the bug

Version of Helm and Kubernetes:

Which chart:

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

helm
helm copied to clipboard