GlusterFS doesn't support O_PATH flag in open()
Description of problem:
When O_PATH is used in an open() system call, GlusterFS doesn't work fine in all cases.
In a FUSE mount, doing the following sequence of operations fails:
fd = open("file", O_PATH);
unlink("file");
fstat(fd, &st);
Checking the logs, it seems that kernel doesn't send the open() request to Gluster, which explains the error becasue Gluster depends on an actual open to keep a file around after the last unlink.
I also tried gfapi and it doesn't work either. In this case the previous code works, but bricks don't see the O_PATH flag, so reads are allowed when they shouldn't. I found that client xlator filters the flags and removes the O_PATH.
Without this Gluster may perform worse in latest versions of Samba, which will use O_PATH in some places to improve performance.
Sounds like a FUSE bug?
Hi @slowfranklin sorry for the late answer.
I think it's not a bug in FUSE. The kernel itself doesn't send open() requests to any filesystem when O_PATH is used. Kernel considers that an inode with active references won't be destroyed by the filesystem, so once the inode has been looked up, no other requests are needed to keep the inode available for future stats (basically the only operation that can be done on an O_PATH opened file).
The problem is that Gluster keeps files around once they have been deleted only while there are open fd's. To solve this issue we shouldn't completely delete a file until the last reference to the inode has been released, independently of the file descriptors.
@xhernandez wanted to understand this issue, and challenges to implement fop_at() calls through gfapi. Let me know when you have time, we can syncup and then update this issue with meeting minutes.
Description of problem:
When O_PATH is used in an
open()system call, GlusterFS doesn't work fine in all cases.In a FUSE mount, doing the following sequence of operations fails:
fd = open("file", O_PATH); unlink("file"); fstat(fd, &st);Checking the logs, it seems that kernel doesn't send the
open()request to Gluster, which explains the error becasue Gluster depends on an actual open to keep a file around after the last unlink.I also tried gfapi and it doesn't work either. In this case the previous code works, but bricks don't see the O_PATH flag, so reads are allowed when they shouldn't. I found that client xlator filters the flags and removes the O_PATH.
Without this Gluster may perform worse in latest versions of Samba, which will use O_PATH in some places to improve performance.
I think server_xlator filters the flag(O_PATH) by the function (gf_flags_to_flags).
Description of problem: When O_PATH is used in an
open()system call, GlusterFS doesn't work fine in all cases. In a FUSE mount, doing the following sequence of operations fails:fd = open("file", O_PATH); unlink("file"); fstat(fd, &st);Checking the logs, it seems that kernel doesn't send the
open()request to Gluster, which explains the error becasue Gluster depends on an actual open to keep a file around after the last unlink. I also tried gfapi and it doesn't work either. In this case the previous code works, but bricks don't see the O_PATH flag, so reads are allowed when they shouldn't. I found that client xlator filters the flags and removes the O_PATH. Without this Gluster may perform worse in latest versions of Samba, which will use O_PATH in some places to improve performance.I think server_xlator also filters the flag(O_PATH) by the function (gf_flags_to_flags).
I think server_xlator filters the flag(O_PATH) by the function (gf_flags_to_flags).
Yes, we also need to do some changes, but they are very simple. The main issue is the lack of an actual open call from kernel for entries opened with O_PATH.
The main issue is the lack of an actual open call from kernel for entries opened with O_PATH.
I presume all those things needs to be done in any case if we need to support O_PATH with gfapi ?
The main issue is the lack of an actual open call from kernel for entries opened with O_PATH.
I presume all those things needs to be done in any case if we need to support O_PATH with gfapi ?
In the case of gfapi it's not defined what's expected. If we assume that gfapi clients will behave as the kernel, then the only thing we'll see is a lookup. However I think we have more margin here to require that O_PATH opens must be sent. This would reduce the problem (or it could even work after some fixes), but this would create two issues IMO:
- Inconsistency between FUSE and gfapi behaviors. This will lead to other problems sooner or later.
- It doesn't take advantage of the main reason why O_PATH is used: it's a performance improvement because it doesn't require to actually process the full open request. If we still process O_PATH opens down to the bricks and posix layer of Gluster, it's irrelevant to use it and we could simply ignore the flag and just make some additional checks for reads and writes.
Summary of my discussions with @xhernandez and path ahead:
Why was it not done before?
Mostly when we first implemented open(), we took flags handled in fuse and used only those flags, and O_PATH is not handled in fuse layer too. Mainly because the kernel itself maps O_PATH to a lookup() with ‘nlookup’ increase.
Why we need O_PATH ?
For consistency (and due to it, better caching) reasons, many applications running on filesystem are using file descriptor (fd) opened with O_PATH in ‘openat()’, mkdirat(), etc (ie, all ‘${fop}at()’ calls). This gives a better consistency from path being altered while someone is operating at lower nodes of the path tree. For example applications/services like smbd use these ‘at()’ calls in their vfs fsal layers.
While glusterfs’s open() fop originating from fuse layer may not have O_PATH, it may be present in open originating from glfs_open().
How to get this implemented?
To get a proper implementation of O_PATH in glusterfs, which is consistent with both fuse and libgfapi is an effort which deals with how we manage inode references today. Will give more details on this in the later part of this section.
NOTE: Whichever way we support O_PATH in glusterfs, it would make changes in protocol layer change (ie, in XDR and may be in how xdata’s fields are interpreted). Thus, only when both client and server would be of certain versions, it would be supporting O_PATH feature.
Changes for O_PATH
It would be good to implement it in phases IMO.
Part I - Get O_PATH passed to the server/brick process in open() call.
For a moment ignore the fact that we don’t receive the O_PATH in fuse, and treat glusterfs’s protocol as if it receives the O_PATH. Today, O_PATH is not handled in glusterfs’s protocol layer, and just handling it in protocol layer should allow glusterfs’s open() more posix compliant.
This can be demonstrated by using glfs_open() / glfs_unlink() / glfs_fstat() calls to prove the working.
PR on this can include this test case to get started.
Part II - pass ‘client’s’ ‘nlookup’ of inode table to server side, and also pass the same in forget too.
This itself is PR which would need more testing for reference leaks. But idea here is simple. If any client mount has a ‘reference’ on the inode, server brick also should have a reference on the same file. This itself will bring consistency.
Part III - Handle server inode pruning properly to handle nlookup
Server’s inode_pruning should properly involve sending invalidation to client and only if client gets a forget(), server should forget the inode, otherwise, it should keep the reference intact.
Implementing ‘at()’ calls in glfs.
This part can get started just after ‘Part I’ from above can be completed. Thus, a proper test case also can be added to this.
This will help in vfs_glusterfs.c of smb/source3, and smb/source4 to work smoothly with glusterfs.
Updated this here so work on this can be started. More updates will be given along with the PR, and/or here.
Would be nice to get all those new gfapi calls implemented mentioned in the release 11 tracker.
Is that done and complete, for Gluster 11?
O_PATH support for gfapi should work, though it's not optimized, but FUSE mounts won't support it until #3812 is addressed.
Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.
Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.
Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.