glusterfs icon indicating copy to clipboard operation
glusterfs copied to clipboard

GlusterFS doesn't support O_PATH flag in open()

Open xhernandez opened this issue 4 years ago • 14 comments

Description of problem:

When O_PATH is used in an open() system call, GlusterFS doesn't work fine in all cases.

In a FUSE mount, doing the following sequence of operations fails:

    fd = open("file", O_PATH);
    unlink("file");
    fstat(fd, &st);

Checking the logs, it seems that kernel doesn't send the open() request to Gluster, which explains the error becasue Gluster depends on an actual open to keep a file around after the last unlink.

I also tried gfapi and it doesn't work either. In this case the previous code works, but bricks don't see the O_PATH flag, so reads are allowed when they shouldn't. I found that client xlator filters the flags and removes the O_PATH.

Without this Gluster may perform worse in latest versions of Samba, which will use O_PATH in some places to improve performance.

xhernandez avatar Aug 17 '21 10:08 xhernandez

Sounds like a FUSE bug?

slowfranklin avatar Sep 21 '21 17:09 slowfranklin

Hi @slowfranklin sorry for the late answer.

I think it's not a bug in FUSE. The kernel itself doesn't send open() requests to any filesystem when O_PATH is used. Kernel considers that an inode with active references won't be destroyed by the filesystem, so once the inode has been looked up, no other requests are needed to keep the inode available for future stats (basically the only operation that can be done on an O_PATH opened file).

The problem is that Gluster keeps files around once they have been deleted only while there are open fd's. To solve this issue we shouldn't completely delete a file until the last reference to the inode has been released, independently of the file descriptors.

xhernandez avatar Mar 07 '22 11:03 xhernandez

@xhernandez wanted to understand this issue, and challenges to implement fop_at() calls through gfapi. Let me know when you have time, we can syncup and then update this issue with meeting minutes.

amarts avatar Mar 20 '22 18:03 amarts

Description of problem:

When O_PATH is used in an open() system call, GlusterFS doesn't work fine in all cases.

In a FUSE mount, doing the following sequence of operations fails:

    fd = open("file", O_PATH);
    unlink("file");
    fstat(fd, &st);

Checking the logs, it seems that kernel doesn't send the open() request to Gluster, which explains the error becasue Gluster depends on an actual open to keep a file around after the last unlink.

I also tried gfapi and it doesn't work either. In this case the previous code works, but bricks don't see the O_PATH flag, so reads are allowed when they shouldn't. I found that client xlator filters the flags and removes the O_PATH.

Without this Gluster may perform worse in latest versions of Samba, which will use O_PATH in some places to improve performance.

I think server_xlator filters the flag(O_PATH) by the function (gf_flags_to_flags).

mohit84 avatar Mar 21 '22 02:03 mohit84

Description of problem: When O_PATH is used in an open() system call, GlusterFS doesn't work fine in all cases. In a FUSE mount, doing the following sequence of operations fails:

    fd = open("file", O_PATH);
    unlink("file");
    fstat(fd, &st);

Checking the logs, it seems that kernel doesn't send the open() request to Gluster, which explains the error becasue Gluster depends on an actual open to keep a file around after the last unlink. I also tried gfapi and it doesn't work either. In this case the previous code works, but bricks don't see the O_PATH flag, so reads are allowed when they shouldn't. I found that client xlator filters the flags and removes the O_PATH. Without this Gluster may perform worse in latest versions of Samba, which will use O_PATH in some places to improve performance.

I think server_xlator also filters the flag(O_PATH) by the function (gf_flags_to_flags).

mohit84 avatar Mar 21 '22 04:03 mohit84

I think server_xlator filters the flag(O_PATH) by the function (gf_flags_to_flags).

Yes, we also need to do some changes, but they are very simple. The main issue is the lack of an actual open call from kernel for entries opened with O_PATH.

xhernandez avatar Mar 21 '22 07:03 xhernandez

The main issue is the lack of an actual open call from kernel for entries opened with O_PATH.

I presume all those things needs to be done in any case if we need to support O_PATH with gfapi ?

amarts avatar Mar 21 '22 07:03 amarts

The main issue is the lack of an actual open call from kernel for entries opened with O_PATH.

I presume all those things needs to be done in any case if we need to support O_PATH with gfapi ?

In the case of gfapi it's not defined what's expected. If we assume that gfapi clients will behave as the kernel, then the only thing we'll see is a lookup. However I think we have more margin here to require that O_PATH opens must be sent. This would reduce the problem (or it could even work after some fixes), but this would create two issues IMO:

  1. Inconsistency between FUSE and gfapi behaviors. This will lead to other problems sooner or later.
  2. It doesn't take advantage of the main reason why O_PATH is used: it's a performance improvement because it doesn't require to actually process the full open request. If we still process O_PATH opens down to the bricks and posix layer of Gluster, it's irrelevant to use it and we could simply ignore the flag and just make some additional checks for reads and writes.

xhernandez avatar Mar 21 '22 07:03 xhernandez

Summary of my discussions with @xhernandez and path ahead:


Why was it not done before?

Mostly when we first implemented open(), we took flags handled in fuse and used only those flags, and O_PATH is not handled in fuse layer too. Mainly because the kernel itself maps O_PATH to a lookup() with ‘nlookup’ increase.

Why we need O_PATH ?

For consistency (and due to it, better caching) reasons, many applications running on filesystem are using file descriptor (fd) opened with O_PATH in ‘openat()’, mkdirat(), etc (ie, all ‘${fop}at()’ calls). This gives a better consistency from path being altered while someone is operating at lower nodes of the path tree. For example applications/services like smbd use these ‘at()’ calls in their vfs fsal layers.

While glusterfs’s open() fop originating from fuse layer may not have O_PATH, it may be present in open originating from glfs_open().

How to get this implemented?

To get a proper implementation of O_PATH in glusterfs, which is consistent with both fuse and libgfapi is an effort which deals with how we manage inode references today. Will give more details on this in the later part of this section.

NOTE: Whichever way we support O_PATH in glusterfs, it would make changes in protocol layer change (ie, in XDR and may be in how xdata’s fields are interpreted). Thus, only when both client and server would be of certain versions, it would be supporting O_PATH feature.

Changes for O_PATH

It would be good to implement it in phases IMO.

Part I - Get O_PATH passed to the server/brick process in open() call.

For a moment ignore the fact that we don’t receive the O_PATH in fuse, and treat glusterfs’s protocol as if it receives the O_PATH. Today, O_PATH is not handled in glusterfs’s protocol layer, and just handling it in protocol layer should allow glusterfs’s open() more posix compliant.

This can be demonstrated by using glfs_open() / glfs_unlink() / glfs_fstat() calls to prove the working.

PR on this can include this test case to get started.

Part II - pass ‘client’s’ ‘nlookup’ of inode table to server side, and also pass the same in forget too.

This itself is PR which would need more testing for reference leaks. But idea here is simple. If any client mount has a ‘reference’ on the inode, server brick also should have a reference on the same file. This itself will bring consistency.

Part III - Handle server inode pruning properly to handle nlookup

Server’s inode_pruning should properly involve sending invalidation to client and only if client gets a forget(), server should forget the inode, otherwise, it should keep the reference intact.

Implementing ‘at()’ calls in glfs.

This part can get started just after ‘Part I’ from above can be completed. Thus, a proper test case also can be added to this.

This will help in vfs_glusterfs.c of smb/source3, and smb/source4 to work smoothly with glusterfs.


Updated this here so work on this can be started. More updates will be given along with the PR, and/or here.

amarts avatar Apr 03 '22 07:04 amarts

Would be nice to get all those new gfapi calls implemented mentioned in the release 11 tracker.

mykaul avatar Aug 16 '22 12:08 mykaul

Is that done and complete, for Gluster 11?

mykaul avatar Feb 16 '23 08:02 mykaul

O_PATH support for gfapi should work, though it's not optimized, but FUSE mounts won't support it until #3812 is addressed.

xhernandez avatar Feb 16 '23 10:02 xhernandez

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] avatar Sep 17 '23 06:09 stale[bot]

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] avatar Apr 26 '25 04:04 stale[bot]

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

stale[bot] avatar Jun 27 '25 03:06 stale[bot]