opencloud icon indicating copy to clipboard operation
opencloud copied to clipboard

"internal error: Missing parent ID on node" when uploading many files

Open DamianRyse opened this issue 2 months ago • 12 comments

Describe the bug

I'm trying to upload a folder to a space with three subfolders and a total of 282 files. While most of the files are uploaded fine, some files throw an error message like this:

{
    "level": "error",
    "service": "storage-users",
    "host.name": "opencloud-75b46f489b-jkwxk",
    "pkg": "rgrpc",
    "driver": "posix",
    "error": "internal error: Missing parent ID on node",
    "path": "/var/lib/opencloud/storage/users/projects/c3e284a5-5e4d-4be5-80c6-49b31de55616/filename.ext",
    "time": "2025-10-02T07:43:14Z",
    "message": "failed to read node"
}

Steps to reproduce

  1. Upload a folder with lots of files

Expected behavior

No errors but uploading all files correctly

Actual behavior

See Describe the bug

Setup

OpenCloud is deployed in a k3s cluster. The configuration and data storage are on a mounted NFS storage, the underlying file system is ZFS.

Additional context

Sometimes the files are still uploaded even with the error message. But sometimes the OpenCloud frontend brings an error message with "Unknown Error" and gives me trace IDs.

DamianRyse avatar Oct 02 '25 07:10 DamianRyse

Here's an example frontend error that sometimes occurs. Image

DamianRyse avatar Oct 02 '25 08:10 DamianRyse

OpenCloud is deployed in a k3s cluster. The configuration and data storage are on a mounted NFS storage, the underlying file system is ZFS.

That is interesting. How did you mount the NFS on the node?

For OpenCloud in kubernetes is essential that the NFS mount uses no Caching which is the noac mount option.

Just out of curiosity? How did you deploy opencloud in K3s?

micbar avatar Oct 02 '25 09:10 micbar

The NFS are mounted using the PersistentVolumeClaims in my yaml files. For example:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: opencloud-config-pv
spec:
  storageClassName: ""
  capacity:
    storage: 100Mi
  accessModes:
    - ReadWriteMany
  nfs:
    path: /mnt/Main/k3s/opencloud/config
    server: srv-nas-01.rysenet.local
  persistentVolumeReclaimPolicy: Retain

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: opencloud-config-pvc
spec:
  storageClassName: ""
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100Mi
  volumeName: opencloud-config-pv

And besides the PVCs, here's what I use to deploy to k3s:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: opencloud
  labels:
    app: opencloud
spec:
  replicas: 1
  selector:
    matchLabels:
      app: opencloud
  template:
    metadata:
      labels:
        app: opencloud
    spec:
      nodeSelector:
        region: "public"
      containers:
        - name: opencloud
          image: opencloudeu/opencloud-rolling:latest
          imagePullPolicy: IfNotPresent
          command: ["/bin/sh"]
          args: ["-c", "opencloud init || true; opencloud server"]
          ports:
            - name: http
              containerPort: 9200
            - name: nats
              containerPort: 9233

          env:
            - name: OC_INSECURE
              value: "true"
            - name: OC_DOMAIN
              value: "my.domain.eu"
            - name: OC_URL
              value: "https://my.domain.eu"
            - name: PROXY_HTTP_ADDR
              value: "0.0.0.0:9200"
            - name: INITIAL_ADMIN_PASSWORD
              value: "a-super-secret-initial-password"
            - name: PROXY_ENABLE_BASIC_AUTH
              value: "true"
            - name: PROXY_TLS
              value: "false"
          volumeMounts:
            - name: opencloud-config
              mountPath: /etc/opencloud
            - name: opencloud-data
              mountPath: /var/lib/opencloud
      volumes:
        - name: opencloud-config
          persistentVolumeClaim:
            claimName: opencloud-config-pvc
        - name: opencloud-data
          persistentVolumeClaim:
            claimName: opencloud-data-pvc

---
apiVersion: v1
kind: Service
metadata:
  name: opencloud
spec:
  selector:
    app: opencloud
  ports:
    - port: 9200
      targetPort: 9200
      name: http
    - port: 9233
      targetPort: 9233
      name: nats
  type: ClusterIP


---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: opencloud-ingress
  annotations:
    traefik.ingress.kubernetes.io/router.entrypoints: websecure
spec:
  rules:
    - host: my.domain.eu
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: opencloud
                port:
                  number: 9200

TLS certificate termination is handled by my OPNsense firewall and the HAProxy plugin.

DamianRyse avatar Oct 02 '25 10:10 DamianRyse

That did not answer the question about the real nfs host mount.

micbar avatar Oct 02 '25 10:10 micbar

It actually does. The rest is done automatically by the k3s node and I didn't have to manually mount the share. But here's the output of mount directly on the worker node:

srv-nas-01.rysenet.local:/mnt/Main/k3s/opencloud/data on /var/lib/kubelet/pods/05f69b13-9e26-47fe-a56c-62aa928d4ade/volumes/kubernetes.io~nfs/opencloud-data-pv type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.1.1.4,local_lock=none,addr=192.168.178.12)

DamianRyse avatar Oct 02 '25 10:10 DamianRyse

I have modified the PersistentVolumes and added:

 mountOptions:
    - noac
    - vers=4.2

After restarting the service, the mount options now look like this:

(rw,sync,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,noac,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.1.1.4,local_lock=none,addr=192.168.178.12)

Which should be fine now. I'll test OpenClouds behavior and will provide feedback.

DamianRyse avatar Oct 02 '25 10:10 DamianRyse

So, I'm getting different errors now. The old error still occurs but not that often anymore. Instead, I'm getting a lot of numerical result out of range errors.

{
    "level": "error",
    "service": "storage-users",
    "host.name": "opencloud-67f7cff9f6-wvbw5",
    "pkg": "rgrpc",
    "driver": "posix",
    "error": "xattr.list /var/lib/opencloud/storage/users/projects/c3e284a5-5e4d-4be5-80c6-49b31de55616/filename.ext",
    "time": "2025-10-02T11:01:02Z",
    "message": "failed to read node"
}

DamianRyse avatar Oct 02 '25 11:10 DamianRyse

@DamianRyse thank you for reporting.

micbar avatar Oct 02 '25 11:10 micbar

Another issue I'm facing now is when I'm trying to rename a folder in a space:

{"level":"error","service":"storage-users","host.name":"opencloud-67f7cff9f6-wvbw5","pkg":"rgrpc","traceid":"3ebf412ad45e70384a47ed6ef14c217d","error":"node.XattrsWithReader: no data available","spaceid":"c3e284a5-5e4d-4be5-80c6-49b31de55616","nodeid":"","time":"2025-10-02T12:47:12Z","message":"error reading permissions"}

I'm certainly sure it has something to do with the data dir being a NFS mount. But no clue why and how to fix it.

DamianRyse avatar Oct 02 '25 12:10 DamianRyse

@butonic @rhafer any ideas?

micbar avatar Oct 02 '25 13:10 micbar

After experimenting around more with this setup, I'd like to share my observations:

Case 1: Data dir is a default NFS share

OpenClouds speed is somewhat okay. It does not fully utilize a standard gigabit ethernet connection. Speed tests with iperf3 otherwise showed full speed potential. When uploading multiple files via the webinterface, a lot of POSIX errors occured (see my comments above). Sometimes files fail to be uploaded completely, even when retrying them as a single file upload afterwards again.

Case 2: Data dir is a NFS share but with noac option

OpenCloud is basically unusable in this configuration. Disabling the caching completely forces the filesystem to flush every IO operation before continuing with the next one. This results in stuttering uploads where you have short bursts of data transfers then a break for a few seconds before the next burst happens. While uploading files and therefore occupying the filesystem, the webinterfaces responsivness is highly degraded.

Case 3: Data dir is a NFS share but with low caching times

Instead of disabling the cache completely, I've limited the caching timings to very low values:

acregmin=1
acregmax=5
acdirmin=1
acdirmax=5

This did help a little bit performance wise and stability wise but by far not as good as expected. I've tried various timings to figure out if it affects somewhat the file uploads but it had only little impact.

Case 4: Replaced NFS with an iSCSI block device

This is currently my best solution to a network based storage for OpenCloud. Uploading many small files does work flawless without any of the mentioned posix errors. Data transfer is still very slow compared to what would be possible. Uploading about 200 files each about 7 MB in size results in an average upload speed of 25 MiB/sec. On the other hand, uploading a single large file (about 3 GiB in my test) the upload speed increased to about 60 MiB/sec. A manual file copy from my local client to the block storage maxed at about 106 MiB/sec.

Case 5: Standard containerized installation on a single VM with only local storage

No issues at all. OpenCloud worked perfectly well with very good performance and no errors.

Conclusion

It's clear to me that OpenCloud has a weakness when it's storage is on a network share. I haven't tried SMB but I'd assume it would be similar slow as NFS. I've seen a lot of xattr errors when using NFS which I thought might happen because of a misconfiguration of my NFS share, but I haven't found anything wrong. So, for people who want to use Kubernetes to run their OpenCloud and plan to use a file storage server, I'd recommend going for an iSCSI solution. Even if file transfers are still slow, they at least work without IO errors.

DamianRyse avatar Oct 02 '25 23:10 DamianRyse

@DamianRyse thank you for the thorough comparison and your efforts.

The opencloud team runs opencloud on large network filesystems like CephFS and IBM spectrum scale (GPFS).

I think the AWS elastic filesystem and AWS lustreFS can also be good candidates.

micbar avatar Oct 04 '25 21:10 micbar