Inconsistent inode numbers for mounted files
Description
I'm trying to run fluent-bit inside of gVisor. It uses a hostPath mount to read all the container logs to then forward them somewhere else. In order to keep track of what was already dealt with, it writes a sqlite DB file to the disk (also a hostPath in my case). It keeps track of the underlying files by path and inode.
After noticing some log duplication after rolling my pods, I've dug in and it seems like the files mounted via the hostPath mount do not have consistent inode numbering. It varies from restart to restart of the container. That completely breaks any kinda tracking fluent-bit could be doing in this case.
The inode numbers are very low, suggesting to me that gvisor is doing some internal assignment of these numbers. The "real" inodes are way higher.
I've tried switching directfs and overlay2 on and off but haven't noticed any change in behavior.
Steps to reproduce
Run the following pod and compare the logs it produces. Preferably, there's a couple of pods on the same machine to trigger the effect.
apiVersion: apps/v1
kind: Deployment
metadata:
name: test
spec:
replicas: 1
selector:
matchLabels:
app-component: web
template:
metadata:
labels:
app-component: web
spec:
volumes:
- hostPath:
path: /var/log
type: ""
name: varlogs
containers:
- name: test
image: busybox
command: ["sh", "-c", "ls -li /var/log/containers"]
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /var/log/
name: varlogs
readOnly: true
restartPolicy: Always
runsc version
runsc version release-20240212.0-28-g1303df5f706e
spec: 1.1.0-rc.1
docker version (if using docker)
No response
uname
Linux pool-apps-appworkload-shared-s-4vcpu-8gb-o6j7h 6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)
No response
Yes, gVisor makes up inode numbers for host files. If I recall correctly, this is necessary in order to be able to support checkpoint/restore, such that inode numbers remain consistent after a container is restored on a different machine.
I'm not sure there's a good solution that maintains this property while also addressing this use-case. Perhaps via a runsc flag or a volume mount option?
Hmm. Checkpoint/Restore seems rather finicky in the presence of hostPath mounts anyway, I think? There's no guarantee that the host's filesystem looks identical on another machine I guess 🤔. Isn't checkpoint/restore off the table in such a scenario anyway?
That said, I'd personally be fine with a flag that essentially disables checkpoint/restore in favor of use-cases like described.
There has been work done on this front. Just want to provide the context.
There are two things used to identify files on a filesystem: 1) inode number 2) device ID. You can have different files with the same inode number on different devices. The gVisor sandbox virtualizes device IDs. So we can't share host device IDs.
As of now, what we do in the gofer filesystem is that we give the gofer mount a virtual (sandbox internal) device ID and we generate new inode numbers (incrementally) for each file. The inode number generation is done by combining host inode number and host device ID. This is actually quite expensive. We need to maintain this huge ever growing map (for every unique file ever encountered) that maps [host inode, host device ID] -> gofer inode number. If a host file is not found in this map, we increment this counter and use that value as the inode number.
Note that we can not passthrough the host inode number as is because there might be conflicts (the host filesystem being served itself may have multiple mountpoints with different devices and conflicting inode numbers). Because of this, the syscall implementation for getdents64(2) (which returns inode numbers for each directory entry) requires us to stat(2) each directory entry so we can fetch the host device ID and hence create the above mentioned inode number mapping. This makes getdents64(2) very slow.
For these performance reasons, @nixprime had made this proposal: https://github.com/google/gvisor/issues/6665#issuecomment-939121851. As per this, we can pass through the host inode number but we will map the host device ID to a sentry internal device ID. I had implemented this proposal in #7801. But I had dropped it for S/R reasons: https://github.com/google/gvisor/issues/6665#issuecomment-1716937822.
My question to you is, will the approach taken in #7801 work for you? It will give you the same inode numbers across usages. But the device ID of the file may be different in pods.
For my somewhat specific use-case of fluent-bit specifically, I do believe that just having stable inode numbers across pods would be sufficient.
For posterity, their sqlite DB looks like this
SELECT * FROM in_tail_files;
id name offset inode created rotated
----- -------------------------------- ------------ ------------ ---------- -------
1 /var/log/containers/web-b7b476bc 31046 35 1708613063 0
9-958kk_app-fd3ed79a-330e-4eb3-9
2ef-b4776a2410cd_web-f8327bee563
3aa6642da877f3d5e8ec44afbacdb141
19620880b1bb567b74bd5.log
2 /var/log/containers/web-b7b476bc 4544 36 1708613091 0
9-958kk_app-fd3ed79a-330e-4eb3-9
2ef-b4776a2410cd_web-f8327bee563
3aa6642da877f3d5e8ec44afbacdb141
19620880b1bb567b74bd5.log
3 /var/log/containers/web-b7b476bc 31441 556594 1708615017 0
9-958kk_app-fd3ed79a-330e-4eb3-9
2ef-b4776a2410cd_web-f8327bee563
3aa6642da877f3d5e8ec44afbacdb141
19620880b1bb567b74bd5.log
(the first two instances are different instances of the "same" fluent-bit pod on the same node. The last one is me changing to runc and thus getting the machine's inode)
They don't seem to be taking device ID into account (Vector does: https://vector.dev/docs/reference/configuration/sources/file/#fingerprint.strategy but they provide an alternative strategy that would sidestep this issue altogether).
So yes, I believe just having stable inodes would be fine.
@ayushr2 do you have a feeling for whether or not your proposal could be massaged into an acceptable state wrt. S/R or if it could be put behind a flag for that reason? Let me know if I can be of any help or if you'd want me take a whack at trying to implement it. I'm trying to get a sense of the alternatives that I have for moving forward on our side.
Hey @markusthoemmes, I am having this conversation internally. I think we are committed to checking in the inode passthrough approach (https://github.com/google/gvisor/pull/7801). It is a performance and compatibility win. It is just a matter of whether we want to do that unconditionally or preserve the current behavior behind a flag. I will update here once we have a conclusion.
Reasons for not preserving the current S/R behavior:
- Increased complexity of maintaining both inode numbering approach in fsimpl/gofer.
- Not sure if we have active users of the current behavior.
- Inode passthrough approach should give inode number stability across checkpoint/restore when container filesystem is not migrated.
- Inode stability with filesystem migration should be considered out of scope for runsc. It is considered out of scope for CRIU. Higher level tooling needs to deal with this.
Thanks for the update @ayushr2, hugely appreciated! 🥳
There are some applications that rely on device ID and inode number stability. So just having the "inode passthrough" would not suffice. Because the device IDs are still virtualized and on restore, we would have to reassign sentry-internal device IDs (which may change even though the underlying host device/inode numbers didn't change).
I guess it is best to gate the current behavior behind a flag and implement the "inode passthrough" approach as the default.
I do not have cycles immediately to pick this up. @markusthoemmes if this is urgent for you, feel free to rebase https://github.com/google/gvisor/pull/7801 and implement it with a flag. Happy to code review. Otherwise, I will try to pick this up soon-ish.
@ayushr2 not ultra urgent but it'd be interesting to know a rough timeline just for expectation management. I can try to take a whack at it, but superficially it looks like there's a few dragons there, if both paths have to be kept intact 😅
A friendly reminder that this issue had no activity for 120 days.