helm icon indicating copy to clipboard operation
helm copied to clipboard

Stuck at "Initializing Nextcloud..." when attached to NFS PVC

Open somerandow opened this issue 4 years ago • 59 comments

Doing my best to dupe helm/charts#22920 over to the new repo as I am experiencing this issue as well. I have refined the details a bit, as this issue appears to be specifically related to NFS-based storage.

Describe the bug

When bringing up the nextcloud pod via the helm chart, the logs show the pod as being stuck at:

2020-08-31T19:00:42.054297154Z Configuring Redis as session handler
2020-08-31T19:00:42.098305129Z Initializing nextcloud 19.0.1.1 ...

Even backing out the liveness/readiness probes to over 5 minutes does not give If I instead switch the PVC to my storageClass for Rancher Longhorn (iSCSI) for example, the nextcloud install initializes in seconds.

Version of Helm and Kubernetes:

helm: v3.3.0 kubernetes: v1.18.6

Which chart:

nextcloud/helm

What happened:

  • Namespace is created.
  • Helm creates NFS PVC, or it is created manually
  • Helm instantiates Nextcloud pod
  • Nextcloud pod attaches PVC, and starts
  • Nextcloud container is stuck at the above line

What you expected to happen:

Nextcloud finishes initialization Nextcloud files appear with correct permissions on NFS volume

How to reproduce it (as minimally and precisely as possible):

Set up an NFS provisioner:

helm install stable/nfs-client-provisioner nfs  \
--set nfs.server=x.x.x.x --set nfs.path=<path>

OR Configure an NFS PV and PVC manually

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nextcloud-data
  labels:
    app: cloud
    type: data
spec:
  capacity:
    storage: 100Ti
  nfs:
    path: <path>
    server: <server>
  mountOptions:
    - async
    - nfsvers=4.2
    - noatime
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: nfs-manual
  volumeMode: Filesystem
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: nextcloud-data
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100Ti
  storageClassName: nfs-manual
  volumeMode: Filesystem
  selector:
    matchLabels:
      app: cloud
      type: data

Install nextcloud helm install -f values.yaml nextcloud/helm nextcloud --namespace=nextcloud

values.yaml:

image:
  repository: nextcloud
  tag: 19
readinessProbe:
  initialDelaySeconds: 560
livenessProbe:
  initialDelaySeconds: 560
resources:
  requests:
    cpu: 200m
    memory: 500Mi
  limits:
    cpu: 2
    memory: 1Gi
ingress:
  enabled: true
  annotations:
    cert-manager.io/cluster-issuer: acme
    kubernetes.io/ingress.class: nginx
    # nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
  hosts:
    - "cloud.myhost.com"
  tls:
    - hosts:
        - "cloud.myhost.com"
      secretName: prod-cert
  path: /
nextcloud:
  username: admin
  password: admin1
  # datadir: /mnt/data
  host: "cloud.myhost.com"
internalDatabase:
  enabled: true
externalDatabase:
  enabled: false
persistence:
  enabled: true
  # accessMode: ReadWriteMany
  # storageClass: nfs-client if creating via provisioner
  existingClaim: nextcloud-data # comment out if creating new PVC via provisioner

somerandow avatar Aug 31 '20 19:08 somerandow

I will add as well that my example PV above includes:

  mountOptions:
    - async
    - nfsvers=4.2
    - noatime

These do not appear to affect (or improve) the NFS performance at all in this case. Based on the other deployments I have utilizing NFS, this seems odd.

somerandow avatar Aug 31 '20 19:08 somerandow

Hello there,

I've got the same issue, NFS PVC works well with Nextcloud v17. But, as you @WojoInc with Nextcloud v19 i'm stuck at "Initializing Nextcloud...".

Even if installation seems fail, and the Pod loop on restart , My NFS volume seems wrotten by Nextcloud v19 data. I'm trying now to get more verbosity about that.

Have a nice time :)

thunerbl avatar Sep 01 '20 15:09 thunerbl

Hi,

I faced the same problem. I logged in to the physical node and watched the docker logs. There I saw that Nextcloud tried to connect via HTTP to the defined Host. I have HAProxy (OPNSense) in front of Kubernetes and redirect all HTTP to HTTPS. This was an issue. For the init process of Nextcloud I temporary added the HTTP rule for it and the process completed without problems.

Maybe you have a similar setup?

BR Scizoo

Scizoo88 avatar Sep 02 '20 21:09 Scizoo88

Hello @Scizoo88,

Thanks for sharing your experience. I don't think I've that setup, because my Nextcloud 19 pod, without NFS PVC for now, is accessible both via HTTP and HTTPS.

In my case, the unique difference between a working and not working setup, is that i've enabled data persistence (if I choose Nextcloud v19). Persistence greatly worked on Nextcloud 17, with the same Kubernetes network setup tought

Have a nice day,

thunerbl avatar Sep 04 '20 06:09 thunerbl

Okeii I've managed to connect with externalDB, Nextcloud 19 seems installed and functionnal pretty well, PVC enabled. Maybe this error is SQLite related.

thunerbl avatar Sep 16 '20 11:09 thunerbl

Hi guys, I already checked this. We´re using a fixed fsGroup for the apache and the nginx container. Because nextcloud copies files around via rsync on startup it relies on valid permissions to the volumes.

But in my case the user id and groups on my nfs client mount are different. My logs show permission denied errors.

I see two possible solutions:

  • add a sidecar or possibility for generic sidecar containers to make somethink like chown -R ....
  • try to use securityContext.fsGroupChangePolicy = Always (kubernetes 1.18 alpha)

For the moment I would tend to go for sidecar possibility so that you guys can handle volume permissions by yourself.

Best

chrisingenhaag avatar Sep 16 '20 11:09 chrisingenhaag

I seem to run in to errors with permissions even when the nfs mount is owned by www-data. I have tried manually editing the securityContext to set the fsGroupChangePolicy, and this didn't seem to resolve the issue either. I'll dive in a bit more and test out whether a side car or init container could set the permissions correctly.

somerandow avatar Oct 05 '20 20:10 somerandow

I seem to have resolved the performance issues around the use of NFS. Rsync was being forced to use synchronous writes due to NFS default behavior and how rsync checks for copied files. CP was slightly faster, but the real fix was enabling the async option on the NFS export (I had only been adding this option to the Persistent Volume), at least for the initial install. This took the time to initialize nextcloud down from >15 mins to just under 10 seconds.

I plan to test the permissions if the permissions are still an issue now.

somerandow avatar Oct 08 '20 18:10 somerandow

I'm experiencing the same problem, I tried to change the securityContext params but that didn't solve the problem...

J3m5 avatar Oct 08 '20 18:10 J3m5

I think I'm having the same issue:

  1. the container is being periodically restarted
  2. the only output to the log is "Initializing nextcloud 19.0.3.1 ..."
  3. the PVC is automatically created from my NFS storage class

I'll try adding the async option to the host and PV, then report back.

Edit: having trouble adding async to my NFS server because of the storage class provider I'm using.

davad avatar Oct 20 '20 05:10 davad

@WojoInc Could you explain how you changed the NFS export options?

unixfox avatar Nov 04 '20 20:11 unixfox

also looking for guidance here, seeing a permission issue that i'm not sure is an easy solve as i'm also using a nfs-provisioner

kubectl logs nextcloud-7969756654-7j9xh --tail 50 -f
Initializing nextcloud 19.0.4.2 ...
Upgrading nextcloud from 17.0.0.9 ...
Initializing finished
Console has to be executed with the user that owns the file config/config.php
Current user: www-data
Owner of config.php: root
Try adding 'sudo -u root ' to the beginning of the command (without the single quotes)
If running with 'docker exec' try adding the option '-u root' to the docker command (without the single quotes)

i would go change the default permissions of NFS but all other pods using NFS would run into issues then. Previously you discussed options to change the storage owner via a sidecar or fsGroupChangePolicy. Can you please expand on how this is accomplished?

sOblivionsCall avatar Nov 06 '20 17:11 sOblivionsCall

I have the same issue, and the container does not contain any log file. Any workaround for this?

EDIT: the issue appear to come from the livenessProbe delay being too low, the initialization does not have time to finish. Disabling both livenessProbe and readinessProbe worked for me (Nextcloud 19-apache):

livenessProbe:
  enabled: false
readinessProbe:
  enabled: false

sundowndev avatar Nov 09 '20 18:11 sundowndev

I seem to have resolved the performance issues around the use of NFS. Rsync was being forced to use synchronous writes due to NFS default behavior and how rsync checks for copied files. CP was slightly faster, but the real fix was enabling the async option on the NFS export (I had only been adding this option to the Persistent Volume), at least for the initial install. This took the time to initialize nextcloud down from >15 mins to just under 10 seconds.

I plan to test the permissions if the permissions are still an issue now.

@WojoInc Are you using the nextcloud helm chart with replication set to e.g. 3?

Janl1 avatar Dec 08 '20 11:12 Janl1

I'm using the following configuration on the helm chart using terraform to set up the release:

resource "kubernetes_namespace" "ns_files" {
  metadata {
    name = "files"
  }
}

resource "helm_release" "rel_files_cloud" {
  repository = "https://nextcloud.github.io/helm/"
  name="cloudfiles"
  chart = "nextcloud"
  namespace="files"

  values = [
      <<YAML
        ingress:
          enabled: true
          annotations:
            kubernetes.io/ingress.class: traefik
            cert-manager.io/cluster-issuer: cluster-issuer
            traefik.ingress.kubernetes.io/redirect-entry-point: https
            traefik.frontend.passHostHeader: "true"
          tls:
            - hosts:
              - files.haus.net
              secretName: nextcloud-app-tls
      YAML
   ]

  set {
    name = "nextcloud.host"
    value = "files.haus.net"
  }

  set {
      name = "nextcloud.username"
      value = "vault:secret/data/nextcloud/app/credentials#app_user"
  }
  set {
      name = "nextcloud.password"
      value = "vault:secret/data/nextcloud/app/credentials#app_password"
  }
  set {
      name = "mariadb.enabled"
      value = "true"
  }
  set {
      name = "mariadb.db.password"
      value = "vault:secret/data/nextcloud/db/credentials#db_password"
  }
  set {
      name = "mariadb.db.user"
      value = "vault:secret/data/nextcloud/db/credentials#db_user"
  }
  set {
      name = "mariadb.master.persistence.storageClass"
      value = "nfs-client"
  }
  set {
      name = "mariadb.master.annotations.vault\\.security\\.banzaicloud\\.io/vault-addr"
      value = "https://vault.vault-system:8200"
  }
  set {
      name = "mariadb.master.annotations.vault\\.security\\.banzaicloud\\.io/vault-tls-secret"
      value = "vault-cert-tls"
  }
  set {
      name = "mariadb.master.annotations.vault\\.security\\.banzaicloud\\.io/vault-role"
      value = "default"
  }
  set {
      name = "persistence.enabled"
      value = "true"
  }
  set {
      name = "persistence.storageClass"
      value = "nfs-client"
  }
  set {
      name = "persistence.size"
      value = "2.5Ti"
  }
  set {
      name = "podAnnotations.vault\\.security\\.banzaicloud\\.io/vault-addr"
      value = "https://vault.vault-system:8200"
  }
  set {
      name = "podAnnotations.vault\\.security\\.banzaicloud\\.io/vault-tls-secret"
      value = "vault-cert-tls"
  }
  set {
      name = "podAnnotations.vault\\.security\\.banzaicloud\\.io/vault-role"
      value = "default"
  }
}

I end up with the following log for nextcloud:

time="2020-12-15T23:02:34Z" level=info msg="received new Vault token" app=vault-env
time="2020-12-15T23:02:35Z" level=info msg="initial Vault token arrived" app=vault-env
time="2020-12-15T23:02:35Z" level=info msg="spawning process: [/entrypoint.sh apache2-foreground]" app=vault-env
Initializing nextcloud 19.0.3.1 ...

I check the nfs-client-provisioner and notice that the folders have the following permissions:

/mnt/external/files-cloudfiles-nextcloud-nextcloud-pvc-646eb797-7470-4dd3-94cc-590b9ca5a074# ll
total 36
drwxrwxrwx  9 root     root 4096 Dec 15 22:47 ./
drwxr-xr-x 13 root     root 4096 Dec 15 23:07 ../
drwxrwxrwx  2 root     root 4096 Dec 15 22:47 config/
drwxrwxrwx  2 root     root 4096 Dec 15 22:47 custom_apps/
drwxrwxrwx  2 root     root 4096 Dec 15 22:47 data/
drwxrwxrwx  8 www-data root 4096 Dec 15 23:02 html/
drwxrwxrwx  4 root     root 4096 Dec 15 22:47 root/
drwxrwxrwx  2 root     root 4096 Dec 15 22:47 themes/
drwxrwxrwx  2 root     root 4096 Dec 15 22:47 tmp/

My /etc/exports has the following configuration

/mnt/external 192.168.0.120/32(rw,no_root_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000) 172.16.0.0/29(rw,no_root_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000) 10.42.0.0/16(rw,no_root_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000)

mikeyGlitz avatar Dec 15 '20 23:12 mikeyGlitz

I'm not using the Helm chart, I've just manually created a Deployment for NC with an nfs-client-provisioner volume but I experience the same issue. In my case, I moved a previous NC install to k8s, so my log output consists of the initializing line then an upgrading line. Then stuck forever. Execing inside the pod, and running top, it seems an rsync command is running forever.

immanuelfodor avatar Jan 02 '21 18:01 immanuelfodor

What's most disturbing is that the S and D statuses mean sleep and uninterruptible sleep, so it seems all the syncs are not doing anything. Also tried setting fsGroup to 33 but nothing changes, and the existing files are at the right permission from the previous non-k8s install I think.

root@nextcloud-55c6cb7cbd-d9cmv:/var/www/html# ps aux --width 200              
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND     
root           1  0.0  0.0   2388  1444 ?        Ss   18:45   0:00 /bin/sh /entrypoint.sh /usr/bin/supervisord -c /supervisord.conf                           
root          32  0.0  0.1 114460 12568 ?        S    18:45   0:00 rsync -rlDog --chown www-data:root --delete --exclude-from=/upgrade.exclude /usr/src/nextcloud/ /var/www/html/                                                            
root          33  0.0  0.1 126596  8372 ?        S    18:45   0:00 rsync -rlDog --chown www-data:root --delete --exclude-from=/upgrade.exclude /usr/src/nextcloud/ /var/www/html/                                                            
root          34  0.2  0.0 114620  3796 ?        D    18:45   0:01 rsync -rlDog --chown www-data:root --delete --exclude-from=/upgrade.exclude /usr/src/nextcloud/ /var/www/html/                                                            
root          63  0.0  0.0   4000  3076 pts/0    Ss   18:53   0:00 bash        
root          72  0.0  0.0   7640  2664 pts/0    R+   18:54   0:00 ps aux --width 200

immanuelfodor avatar Jan 02 '21 19:01 immanuelfodor

I am having the same issue with v20.0.4.

maxirus avatar Jan 17 '21 02:01 maxirus

I was using NFSv4.2 in my previous try. When I fixed the version to NFSv3, it went further, but then also stuck with an occ PHP command. Symptoms are the same, deep sleep of the PHP thread. My NFS server is in a privileged CentOS Stream LXC container in Proxmox with NFS and FUSE feature enabled and backed by a bind mount from the host. When NC was stuck on both NFSv3/v4, I've seen a kernel panic with NFS logs in it on the host, and I couldn't use NFS reliably further. Restarting the whole host helped only to stabilize it. This is both absurd and funny at once: starting NC in k8s collapsing the whole hypervisor, lol 😃

immanuelfodor avatar Jan 17 '21 04:01 immanuelfodor

I was using NFSv4.2 in my previous try. When I fixed the version to NFSv3, it went further, but then also stuck with an occ PHP command. Symptoms are the same, deep sleep of the PHP thread. My NFS server is in a privileged CentOS Stream LXC container in Proxmox with NFS and FUSE feature enabled and backed by a bind mount from the host. When NC was stuck on both NFSv3/v4, I've seen a kernel panic with NFS logs in it on the host, and I couldn't use NFS reliably further. Restarting the whole host helped only to stabilize it. This is both absurd and funny at once: starting NC in k8s collapsing the whole hypervisor, lol 😃

Can you also replicate huge load average numbers when running nextcloud with NFS in a k8s cluster for at least one week? I've to also restart the node because I get 100 on load average for some unknown reason due to nextcloud.

unixfox avatar Jan 17 '21 08:01 unixfox

I has never started up to get to the web interface, just stuck in either rsync phase (NFSv4.2) or an occ command (NFSv3). The CPU usage was minuscule (~1m).

immanuelfodor avatar Jan 17 '21 09:01 immanuelfodor

I has never started up to get to the web interface, just stuck in either rsync phase (NFSv4.2) or an occ command (NFSv3). The CPU usage was minuscule (~1m).

CPU usage is not the only component to check. Look at the load average on htop while it's doing its thing.

unixfox avatar Jan 17 '21 09:01 unixfox

It really didn't do anything, all threads were sleeping in top (S + D flags).

immanuelfodor avatar Jan 17 '21 10:01 immanuelfodor

I have had the same error and was able to resolve it by fixing /etc/exports. I was also using the nfs-provisioner.

My previous /etc/exports file was

/mnt/nfsdir -async,no_subtree_check *(rw,insecure,sync,no_subtree_check,no_root_squash)

I changed it to the rancher /etc/exports example and I was able to deploy nextcloud successfully.

/mnt/nfsdir    *(rw,sync,no_subtree_check,no_root_squash)

dentropy avatar Feb 23 '21 19:02 dentropy

I've been having this issue as well. I think it's caused by three things:

  • the rsyncs are done on folders with huge amounts of files
  • every file causes quite some IO-operations (lstat, open, write, close)
  • every IO-operation needs to go "over the wire" before rsync continues, as the volume is backed by NFS

When I look at my nfsd stats (grafana/prometheus/node-exporter), there is a lot (+/- 50% of the IOPS) of GetAttr (caused by lstat syscalls) going on during the rsync. When using block-based volumes, these are served from local cache, which is magnitudes quicker.

Sure, async,noatime will improve things, and maybe even throw in NFS3, but in the end you're rsyncing a truckload of files onto an NFS share, and that's not very efficient.

I'd suggest to enable the startupProbe, and tweak the periodSeconds and failureThreshold. This is probably better than tweaking/disabling the readiness/liveness probes.

jonkerj avatar Mar 23 '21 13:03 jonkerj

Same issue with a Kadalu backend. I set the initial delay to a day, let's see what happens...

-- edit two hours to initialize.

danielvandenberg95 avatar Oct 30 '21 13:10 danielvandenberg95

Is there any solution for this issue?

I tried every suggestion with no success :(

dcardellino avatar Nov 22 '21 06:11 dcardellino

Don't believe anyone when they tell you NFS or CIFS works with file locking. Inevitably you will experience data corruption. I recommend a solution such as longhorn or similar in a kubernetes environment. It will use local storage on each worker node and iscsi behind the scene as needed to create your pvc.

We all start out using NFS in linux world but it just doesn't support full file locking. iSCSI takes some time to learn. You'd be might be better off using something like longhorn and letting it do the iscsi for you. Seriously, abandon NFS, don't waste anymore of your life trying to get it to work.

I can't even begin to tell you how fast and flawlessly everything works with iSCSI and how nice it is to have the slowness of NFS and inevitable bizarre failures of NFS behind me. Make the change. Do it, do it now. (or just buy a network storage device which uses iSCSI)

https://forums.plex.tv/t/roadmap-to-allow-network-share-for-configuration-data/761162

** Update that I wanted to note that I learned if you have the right nfs-specific hardware that NFS can perform as quickly as iSCSI. Also that vmware has some sort of protection it adds to its nfs shares so if using those they actually do support full file locking. Also that longhorn isn't perfect and it tries to use NFS with its RWX shares (sigh), but RWO w/ longhorn works. Think I'm going to switch to rook / ceph.

lknite avatar Dec 25 '21 00:12 lknite

Locking is not the issue here, it's the fact that lstat is not served by a local FS or cache.

I think both NFS and block based solutions have their place, even in a Kubernetes context, and both come with their unique advantages and problems. In this (specific) case I totally agree with you: a block based solution will not have this problem.

jonkerj avatar Dec 25 '21 12:12 jonkerj

It's a permission issue I think. The pod fails with:

rsync: [receiver] chown "/var/www/html/resources/config/.mimetypealiases.dist.json.bYpaGG" failed: Operation not permitted (1)
rsync: [receiver] chown "/var/www/html/resources/config/.mimetypemapping.dist.json.ChHk9F" failed: Operation not permitted (1)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1333) [sender=3.2.3]

All files are synced but because rsync can't do chown it returns a non-zero code.

devent avatar Jan 24 '22 15:01 devent