helm icon indicating copy to clipboard operation
helm copied to clipboard

Nextcloud fails to initialize when using a claim with "accessMode: ReadWriteMany" for primary persistence.

Open ScionOfDesign opened this issue 2 years ago • 29 comments
trafficstars

Describe your Issue

When trying to able persistence for NextCloud, the container hangs when attempting to us emy own existing pvc's.

Logs and Errors

The container hangs with the following log:

Configuring Redis as session handler
Initializing nextcloud 26.0.1.1 ...

Describe your Environment

  • Kubernetes distribution: rke2

  • Helm Version (or App that manages helm): v3.11.3

  • Helm Chart Version: 3.5.12

My persistence section:

  persistence:
    # Nextcloud Data (/var/www/html)
    enabled: true
    existingClaim: nextcloud-html-data-claim

    nextcloudData:
      enabled: true
      existingClaim: nextcloud-user-data-claim

Additional context, if any

It works fine if I comment out the usage of existing claims.

ScionOfDesign avatar May 19 '23 00:05 ScionOfDesign

It apparently has the same issue if I try to use my own storage classes and configuration.

persistence:
    # Nextcloud Data (/var/www/html)
    enabled: true
    #existingClaim: nextcloud-html-data-claim
    storageClass: longhorn-nvme
    accessMode: ReadWriteMany
    size: 8Gi

    nextcloudData:
      enabled: true
      storageClass: longhorn-block
      accessMode: ReadWriteMany
      size: 10Gi
      #existingClaim: nextcloud-user-data-claim

ScionOfDesign avatar May 19 '23 00:05 ScionOfDesign

It seems that the issue is with the accessMode of the primary persistent volume. It cannot be ReadWriteMany. This works:

  persistence:
    # Nextcloud Data (/var/www/html)
    enabled: true
    #existingClaim: nextcloud-html-data-claim
    storageClass: longhorn-block
    #accessMode: ReadWriteOnce
    #size: 8Gi

    nextcloudData:
      enabled: true
      storageClass: longhorn-block
      accessMode: ReadWriteMany
      #size: 10Gi
      #existingClaim: nextcloud-user-data-claim

ScionOfDesign avatar May 19 '23 00:05 ScionOfDesign

The issue seems to be related to: https://github.com/nextcloud/helm/issues/10 Disabling the probes worked. I will continue to investigate.

ScionOfDesign avatar May 19 '23 02:05 ScionOfDesign

Having the same issue using my own storage class as well. Disabling probes doesnt help as the server seems to have trouble starting.

jgrossmac avatar May 19 '23 04:05 jgrossmac

Same issue for me too. Only difference disabling probes had for me is that the pod is now 'running', but the same issue persists - Initializing nextcloud 26.0.2.1 ...

boomam avatar Jun 16 '23 18:06 boomam

hmmm, for those having this issue, could you let me know if there's any Events listed when you do a:

# replace $NEXTCLOUD_POD with your actual pod name
kubectl describe pod $NEXTCLOUD_POD

Similarly, the existing claims, do they have any Events when you run a describe?

# replace $NEXTCLOUD_PVC with your actual pvc name
kubectl describe pvc $NEXTCLOUD_PVC

Also does the status show pending there for the PVC?

@tvories or @provokateurin have you tried using ReadWriteMany PVCs with the nextcloud container before? I tried to use longhorn at one point, but couldn't get it working, and assumed it was because I misconfigured longhorn so I gave up and went back to the local path with k3s 🤔 We'd need to have ReadWriteMany working in order to support multiple pod replicas accross multiple nodes accessing the same PVC but I'm unsure what's currently blocking that....

Anyone else in the community who has knowledge on this is also welcome to give input :)

jessebot avatar Jul 06 '23 08:07 jessebot

A have used this chart with longhorn in the past, but it was RWO iirc.

provokateurin avatar Jul 06 '23 09:07 provokateurin

I am using ReadWriteMany on an NFS mount for my primary Nextcloud storage and have been for a very long time:

@ScionOfDesign can you paste your PVC values? My guess is that this is a Longhorn configuration issue.

I am using existingClaim for my pvc rather than having the chart create it.

# pvc.yaml
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: nextcloud-nfs-config
spec:
  storageClassName: nextcloud-nfs-config
  capacity:
    storage: 1Mi
  accessModes:
    - ReadWriteMany
  nfs:
    path: /mnt/fatguys/k8s/nextcloud
    server: ${SECRET_NAS1}
  mountOptions:
    - nfsvers=4.1
    - tcp
    - intr
    - hard
    - noatime
    - nodiratime
    - rsize=1048576
    - wsize=1048576
# values.yaml
    persistence:
      enabled: true
      accessMode: ReadWriteMany
      size: 1Mi
      existingClaim: nextcloud-nfs-config

tvories avatar Jul 06 '23 15:07 tvories

Facing the same issue with the newest version of the Helm chart. Nextcloud container is stuck at Initializing nextcloud 27.0.0.8 ... when using a Longhorn RWX volume (served via NFS).

lorenzo-w avatar Jul 18 '23 20:07 lorenzo-w

I'm using Longhorn now on k3s having followed this guide, and although its definitely slower than local path (due to longhorn's abstraction layer), it seems to be working for me right now. I'm not using NFS though :/

Here's my current values.yaml. I'm using existing claims for both nextcloud files and postgres. Here's nextcloud's pvc.

I noticed one of the things I keep seeing is that the users in this thread with the failure to initialize are using two PVCs for nextcloud by using persistence.nextcloudData.enabled: true. Could anyone having the issue verify a couple of things for us?

  1. does this happen if you use longhorn without NFS?
  2. does this happen if you use only one PVC (i.e. persistence.nextcloudData.enabled: false)?

To clarify, this should work with two PVCs and I'm not suggesting we don't support that. I'm just trying to narrow down the exact issue. 🤔

edit: update links to point to specific commit in time

jessebot avatar Jul 21 '23 08:07 jessebot

@ScionOfDesign Your answer worked for me. Try adding the following to your values.yaml. If I remember correctly, this allows the livenessProbe and readinessProbe more time so they don't restart the container when it taking a while to install. If you need longer, you can raise these values.

startupProbe:
  enabled: true
  initialDelaySeconds: 120
  failureThreshold: 50

It'll take a while to install still. I think I saw somewhere that it took nearly two hours for some poor guy. For me though it usually takes 10-20 minutes.

christensenjairus avatar Jul 22 '23 15:07 christensenjairus

That's such a long install time though :o

jessebot avatar Jul 23 '23 10:07 jessebot

Any Update on this Issue? We're facing the same issue and did configured the startProbe to be extremly long. But like @christensenjairus wrote. Updates takes 10-20 Minutes.

From my pov it has something to do with the rsync that will copy the file to /var/www/html. but i don't get it why it is so slow in the nextcloud container during init? I've other container using longhorn and RWX Volumes without that performance problems.

How can we support to get that issue fixed?

mueller-tobias avatar Jan 22 '24 06:01 mueller-tobias

It is not a longhorn only issue. I switched from longhorn to rook-ceph and saw similar issues. Last weekend, I wanted to upgrade from Nextcloud 27 to Nextcloud 28. The whole process took >10 minutes. So I just disabled the probes and re-enabled them later.

During this, I saw three rsync processes working. This is rather strange as I normally can rsync over serveral GiB in the same time when doing backups with the same storage backend.

pfaelzerchen avatar Jan 22 '24 18:01 pfaelzerchen

1. does this happen if you use longhorn without NFS?

When I was using longhorn, this problem did not appear with RWO pvcs. It was a NFS only problem. But as I stated: I didn't have this problem with rook-ceph, except with the major release upgrade from 27 to 28.

2. does this happen if you use only one PVC (i.e. `persistence.nextcloudData.enabled: false`)?

Yes. I am using one pvc only and it happend with longhorn RWX and for the release upgrade from 27 to 28 it happened on rook-ceph with just one pvc.

pfaelzerchen avatar Jan 22 '24 18:01 pfaelzerchen

Did anyone find the real cause of it? I mean why it takes so long only with RWX? Or did anyone find any other (better) solution for this?

Would we have any performance issue when using RWX, or is it any initialization issue only?

I have been waiting for more than than 20 minutes after adding the startupProbe, but still nothing new is shown :/

MohammedNoureldin avatar Feb 18 '24 17:02 MohammedNoureldin

I tried to set the startupProb like @christensenjairus and both pods were running after a few minutes. But the performance is very poor compared to before. After a few clicks, I got an Internal Server Error and it seems that the data is broken.

For me, I don´t see a possibility to deploy nextcloud high availbility right now. Or am I wrong with that?

My Config:
replicaCount: 2

startupProbe:
  enabled: true
  initialDelaySeconds: 120
  failureThreshold: 50

persistence:
  enabled: true
  accessMode: ReadWriteMany
  size: 8Gi

  nextcloudData:
    enabled: true
    accessMode: ReadWriteMany
    size: 8Gi

Tim-herbie avatar Mar 24 '24 16:03 Tim-herbie

@Tim-herbie, is it still the case? I mean ignoring that after a few clicks you get internal error.

Using Longhorn with PVC RWX (NFS) takes 20-30 minutes to initialize Nextcloud. The performance is poor.

Have anyone figured out how to resolve it by fine-tuning some magic variables?

MohammedNoureldin avatar May 30 '24 23:05 MohammedNoureldin

@MohammedNoureldin

I didn't find a solution to use more than one replica. I'm using it right know for private purposes with only one.

Tim-herbie avatar May 31 '24 05:05 Tim-herbie

More than 1 replica causing issues is a different issue, and if that's the case, please search the issues for an issue about that, or open a second issue. This issue is specifically about accessMode. Can anyone still working on this confirm if disabling all the probes helps? If it does, we'll close this as having a working around. If not, please let us know what errors you're getting.

jessebot avatar May 31 '24 06:05 jessebot

I think he refers to that when he want's to use more then 1 replica he needs a RWX Volume which isn't currently working well due to the fact that the init/upgrade process needs so long.

mueller-tobias avatar May 31 '24 06:05 mueller-tobias

@mueller-tobias that makes sense, however, more than replica is still something that is a separate issue, as this issue implies that it would break even with one replica. Multiple replicas causing issues though is a known issue, but I can't seem to find the last time it was brought up 🤔

jessebot avatar May 31 '24 07:05 jessebot

You are right, @jessebot I was particularly talking about the RWX on NFS. Even with 1 replica on RWX NFS the the whole initialization and performance are poor.

@mueller-tobias exactly, thank you.

The issue is here obvious, creating/copying files to the RWX volume takes too long. If you observe the volume during the initialization, you will see that on every page refresh the size increases by ~2MB. So you can imaging how long it will take to reach the 2.5GB (the estimated final initialization size).

Though the cause of the issue is not clear to me. I am not sure if this is an issue in Nextcloud, or in NFS itself? I mean should the solution in/by implemented by Nextcloud, or should it be by adapting the NFS configs? I saw people talk about turning off the NFS sync, with a small risk to lose some data. Losing data is not good. That is why I am still looking for other safer solution.

MohammedNoureldin avatar May 31 '24 07:05 MohammedNoureldin

Ok, we're on the same page then. To be sure, did you try the suggestions in https://github.com/nextcloud/helm/issues/399#issuecomment-1623875028 ? If those don't work, maybe tagging tvories to troubleshoot may be helpful.

Sorry for not being more helpful. I don't run NFS personally, but I've added an NFS label, as NFS comes up a frequently enough in the issues and I'm going to start grouping them all together for easier searching as I come by them.

jessebot avatar May 31 '24 08:05 jessebot

Hi, @tvories, may I ask for your support?

I am trying to enhance the very poor performance and initialization time when using RWX with Nextcloud. @jessebot suggested to check the configuration you posted.

I am using Longhorn with NFS-common installed on all nodes.

I created a custom StorageClass and add the same configuration as you showed:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: longhorn-nfs-test
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: Immediate
parameters:
  numberOfReplicas: "2"
  staleReplicaTimeout: "2880"
  fromBackup: ""
  fsType: "ext4"
  nfsOptions: "nfsvers=4.2,tcp,,intr,hard,noatime,nodiratime,rsize=1048576,wsize=1048576"

Still I see the same horrible performance.

I noticed kind of in the beginning, the initialization was quick enough, but after the first 200 MB, it started to do probably less than 1MB/s and it keeps getting slowed and slower...

Do you have any suggestion to debug this please?

MohammedNoureldin avatar May 31 '24 13:05 MohammedNoureldin

Can anyone still working on this confirm if disabling all the probes helps? If it does, we'll close this as having a working around. If not, please let us know what errors you're getting.

I can confirm that disabling the probes is a functional workaround. But I would consider tweaking the startupProbe a better solution:

startupProbe:
  enabled: true
  initialDelaySeconds: 120 #30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 50 #30
  successThreshold: 1

This keeps Kubernetes waiting long enough to get even major upgrades done. Probably one has to play around with the initialDelaySeconds or failureThreshold if the performance is better or worse.

pfaelzerchen avatar May 31 '24 16:05 pfaelzerchen

Can anyone still working on this confirm if disabling all the probes helps? If it does, we'll close this as having a working around. If not, please let us know what errors you're getting.

I can confirm that disabling the probes is a functional workaround. But I would consider tweaking the startupProbe a better solution:

startupProbe:
  enabled: true
  initialDelaySeconds: 120 #30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 50 #30
  successThreshold: 1

This keeps Kubernetes waiting long enough to get even major upgrades done. Probably one has to play around with the initialDelaySeconds or failureThreshold if the performance is better or worse.

That is just a workaround around the real issue, that the whole PV and initialization and even the performance of Nextcloud with RWX is horrible.

I understand that delaying the probes helps to run the software, but we should try to find a proper solution. Maybe by fine tuning the NFS options, or I don't know how, any suggestion would be great and helpful.

MohammedNoureldin avatar May 31 '24 16:05 MohammedNoureldin

@tvories @jessebot I rechecked and can confirm what I mentioned in the comment above https://github.com/nextcloud/helm/issues/399#issuecomment-2142214767

Initializing Nextcloud on NFS starts with a good speed, the PV gets filled really quickly, so I can say starts with more than 25 MB/s, and slowly slows down, until about 200 MB of the PV is used, at this point it becomes horribly slow, almost 0.1 MB/s.

What could the cause be?

MohammedNoureldin avatar May 31 '24 17:05 MohammedNoureldin

@MohammedNoureldin I see you have some NFS settings defined in your NFS StorageClass. I'm assuming that it has something to do with how your are hosting your NFS share or some configuration there. Do you have NFS v4.2 enabled on your NFS server? Have you tried adjusting some of your NFS settings to see if it makes a difference?

It's going to be hard to troubleshoot without knowing all of the details of your network and storage situation.

You could eliminate NFS being the culprit by trying a different storage class and seeing if another storage class works better?

tvories avatar Jun 03 '24 14:06 tvories