gvisor Inconsistent inode numbers for mounted files

Description

I'm trying to run fluent-bit inside of gVisor. It uses a hostPath mount to read all the container logs to then forward them somewhere else. In order to keep track of what was already dealt with, it writes a sqlite DB file to the disk (also a hostPath in my case). It keeps track of the underlying files by path and inode.

After noticing some log duplication after rolling my pods, I've dug in and it seems like the files mounted via the hostPath mount do not have consistent inode numbering. It varies from restart to restart of the container. That completely breaks any kinda tracking fluent-bit could be doing in this case.

The inode numbers are very low, suggesting to me that gvisor is doing some internal assignment of these numbers. The "real" inodes are way higher.

I've tried switching directfs and overlay2 on and off but haven't noticed any change in behavior.

Steps to reproduce

Run the following pod and compare the logs it produces. Preferably, there's a couple of pods on the same machine to trigger the effect.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app-component: web
  template:
    metadata:
      labels:
        app-component: web
    spec:
      volumes:
      - hostPath:
          path: /var/log
          type: ""
        name: varlogs
      containers:
      - name: test
        image: busybox
        command: ["sh", "-c", "ls -li /var/log/containers"]
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - mountPath: /var/log/
          name: varlogs
          readOnly: true
      restartPolicy: Always

runsc version

runsc version release-20240212.0-28-g1303df5f706e
spec: 1.1.0-rc.1

docker version (if using docker)

No response

uname

Linux pool-apps-appworkload-shared-s-4vcpu-8gb-o6j7h 6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux

kubectl (if using Kubernetes)

No response

repo state (if built from source)

No response

runsc debug logs (if available)

No response

Feb 22 '24 16:02 markusthoemmes

Yes, gVisor makes up inode numbers for host files. If I recall correctly, this is necessary in order to be able to support checkpoint/restore, such that inode numbers remain consistent after a container is restored on a different machine. I'm not sure there's a good solution that maintains this property while also addressing this use-case. Perhaps via a runsc flag or a volume mount option?

Feb 23 '24 02:02 EtiennePerot

Hmm. Checkpoint/Restore seems rather finicky in the presence of hostPath mounts anyway, I think? There's no guarantee that the host's filesystem looks identical on another machine I guess 🤔. Isn't checkpoint/restore off the table in such a scenario anyway?

That said, I'd personally be fine with a flag that essentially disables checkpoint/restore in favor of use-cases like described.

Feb 23 '24 08:02 markusthoemmes

There has been work done on this front. Just want to provide the context.

There are two things used to identify files on a filesystem: 1) inode number 2) device ID. You can have different files with the same inode number on different devices. The gVisor sandbox virtualizes device IDs. So we can't share host device IDs.

As of now, what we do in the gofer filesystem is that we give the gofer mount a virtual (sandbox internal) device ID and we generate new inode numbers (incrementally) for each file. The inode number generation is done by combining host inode number and host device ID. This is actually quite expensive. We need to maintain this huge ever growing map (for every unique file ever encountered) that maps [host inode, host device ID] -> gofer inode number. If a host file is not found in this map, we increment this counter and use that value as the inode number.

Note that we can not passthrough the host inode number as is because there might be conflicts (the host filesystem being served itself may have multiple mountpoints with different devices and conflicting inode numbers). Because of this, the syscall implementation for getdents64(2) (which returns inode numbers for each directory entry) requires us to stat(2) each directory entry so we can fetch the host device ID and hence create the above mentioned inode number mapping. This makes getdents64(2) very slow.

For these performance reasons, @nixprime had made this proposal: https://github.com/google/gvisor/issues/6665#issuecomment-939121851. As per this, we can pass through the host inode number but we will map the host device ID to a sentry internal device ID. I had implemented this proposal in #7801. But I had dropped it for S/R reasons: https://github.com/google/gvisor/issues/6665#issuecomment-1716937822.

My question to you is, will the approach taken in #7801 work for you? It will give you the same inode numbers across usages. But the device ID of the file may be different in pods.

Feb 23 '24 14:02 ayushr2

For my somewhat specific use-case of fluent-bit specifically, I do believe that just having stable inode numbers across pods would be sufficient.

For posterity, their sqlite DB looks like this

SELECT * FROM in_tail_files;
id     name                              offset        inode         created     rotated
-----  --------------------------------  ------------  ------------  ----------  -------
1      /var/log/containers/web-b7b476bc  31046         35            1708613063  0
       9-958kk_app-fd3ed79a-330e-4eb3-9
       2ef-b4776a2410cd_web-f8327bee563
       3aa6642da877f3d5e8ec44afbacdb141
       19620880b1bb567b74bd5.log

2      /var/log/containers/web-b7b476bc  4544          36            1708613091  0
       9-958kk_app-fd3ed79a-330e-4eb3-9
       2ef-b4776a2410cd_web-f8327bee563
       3aa6642da877f3d5e8ec44afbacdb141
       19620880b1bb567b74bd5.log

3      /var/log/containers/web-b7b476bc  31441         556594        1708615017  0
       9-958kk_app-fd3ed79a-330e-4eb3-9
       2ef-b4776a2410cd_web-f8327bee563
       3aa6642da877f3d5e8ec44afbacdb141
       19620880b1bb567b74bd5.log

(the first two instances are different instances of the "same" fluent-bit pod on the same node. The last one is me changing to runc and thus getting the machine's inode)

They don't seem to be taking device ID into account (Vector does: https://vector.dev/docs/reference/configuration/sources/file/#fingerprint.strategy but they provide an alternative strategy that would sidestep this issue altogether).

So yes, I believe just having stable inodes would be fine.

Feb 23 '24 15:02 markusthoemmes

@ayushr2 do you have a feeling for whether or not your proposal could be massaged into an acceptable state wrt. S/R or if it could be put behind a flag for that reason? Let me know if I can be of any help or if you'd want me take a whack at trying to implement it. I'm trying to get a sense of the alternatives that I have for moving forward on our side.

Feb 27 '24 07:02 markusthoemmes

Hey @markusthoemmes, I am having this conversation internally. I think we are committed to checking in the inode passthrough approach (https://github.com/google/gvisor/pull/7801). It is a performance and compatibility win. It is just a matter of whether we want to do that unconditionally or preserve the current behavior behind a flag. I will update here once we have a conclusion.

Reasons for not preserving the current S/R behavior:

Increased complexity of maintaining both inode numbering approach in fsimpl/gofer.
Not sure if we have active users of the current behavior.
Inode passthrough approach should give inode number stability across checkpoint/restore when container filesystem is not migrated.
Inode stability with filesystem migration should be considered out of scope for runsc. It is considered out of scope for CRIU. Higher level tooling needs to deal with this.

Feb 27 '24 12:02 ayushr2

Thanks for the update @ayushr2, hugely appreciated! 🥳

Feb 27 '24 14:02 markusthoemmes

There are some applications that rely on device ID and inode number stability. So just having the "inode passthrough" would not suffice. Because the device IDs are still virtualized and on restore, we would have to reassign sentry-internal device IDs (which may change even though the underlying host device/inode numbers didn't change).

I guess it is best to gate the current behavior behind a flag and implement the "inode passthrough" approach as the default.

Mar 01 '24 01:03 ayushr2

I do not have cycles immediately to pick this up. @markusthoemmes if this is urgent for you, feel free to rebase https://github.com/google/gvisor/pull/7801 and implement it with a flag. Happy to code review. Otherwise, I will try to pick this up soon-ish.

Mar 04 '24 15:03 ayushr2

@ayushr2 not ultra urgent but it'd be interesting to know a rough timeline just for expectation management. I can try to take a whack at it, but superficially it looks like there's a few dragons there, if both paths have to be kept intact 😅

Mar 04 '24 20:03 markusthoemmes

A friendly reminder that this issue had no activity for 120 days.

Jul 03 '24 00:07 github-actions[bot]