glusterfs icon indicating copy to clipboard operation
glusterfs copied to clipboard

Fix inconsistencies caused by the current inode lifecycle management

Open xhernandez opened this issue 3 years ago • 9 comments

Description of problem

Gluster has had some consistency issues when a file is being deleted and some other operation is being executed on its inode at the same time. This caused inconsistencies mostly on bricks from replicated and dispersed volumes. There have been several attempts to fix them, and the situation has improved over time. However we have reached a point where some problems cannot be solved anymore.

The main problem arises from an artificial dependency between inodes and file names: names are just data written into a special file called directory. This data contains an inode id that references an inode, but it's not the inode. The inode just keeps the number of files pointing to it (number of hardlinks) to determine when it's not referenced by any directory entry.

This means that when we delete a file we are just updating the contents of the directory inode, but the inode referenced by the file is somewhere else an can continue existing. This makes it harder to try synchronize updates on the inode with updates on the directory, because they are two different things.

Since posix requires that a posix-compliant filesystem must allow access to a deleted file through an opened file discriptor, Gluster implements this approach:

  • When the posix xlator receives a delete request and it's the last hardlink to an inode, it checks the number of open file descriptors on that file. It there are none, the file is simply deleted from the backend under the assumption that without open file descriptors nor directory entries pointing to it, there won't be other operations to that inode.

However that's not correct. At the same time the delete operation is executed, other operations on the inode could be running. Depending on the order that bricks process them, these operations may or may not succeed (because the inode will be gone). For volumes that keep multiple bricks synchronized, like replicated and dispersed ones, that's a problem. If we ignore the overhead of self-heal we could just say that we take the most common result and fix the others if necessary. That's not an efficient solution, but even then it's still incorrect.

For example, suppose we have two clients doing this in a loop:

  • Client 1: touch a; mv -f a b
  • Client 2: stat b

A rename atomically replaces the target file, so once b is created, the stat on the other client should never fail, because the file doesn't really disappear, it just keeps changing the inode it points to. However, the kernel doesn't use paths for many of the filesystem operations. It just translates the path into an inode id using a special operation called lookup, and then sends requests passing only the inode id (the reason for this is out of the scope of this issue).

What's the problem here ?

The kernel on client 2 sends a lookup and gets an inode id for the existing entry 'b'. Then sends the stat request using that inode id. However in the middle, the other client has created a completely new inode and updated file 'b' to point to the new inode. The previous inode is gone because no one had it opened, and when the stat request arrives to the bricks the inode doesn't exist anymore, returning an error (probably ESTALE) that shouldn't have happened.

We could improve this by adding locks that prevent those cases. So, a rename should be blocked until all operations on 'b' are completed. Then, while the rename is running, all operations on 'b' will also need to be blocked (note that this includes lookups, which currently work without locks). This would cause a serious performance issue just to fix a case that happens quite infrequently, but nonetheless needs to be fixed to be posix-compliant. In any case, even if we did this, it would still not work in all cases.

The implementation of O_PATH has introduced a use case that there's no way to fix it with the current approach, even using locks (check issue #2717). Additionally some of the modifications we introduced to try to address or minimize the inconsistencies between deletes and other operations are creating deadlocks in some cases (check issue #3677).

What's the real problem ?

The main issue here is that the lifecycle of an inode in the kernel is different from the lifecycle of an inode in gluster. All kernel filesystems keep inodes "alive" independently of the directory entries and the open file descriptors. They track the lifecycle just with the number of references to that inode (anyone using the inode takes a reference on it. This is independent of the hard links). When the last reference to an inode is released, it then checks the number of hardlinks. If it's 0, the inode is physically deleted, if not, the inode is kept, even if it doesn't have any name pointing to it.

Since any operation running on an inode will have a reference to it, it's impossible that those operations fail with ESTALE because the inode is guaranteed to still be there. With this approach the case explained before works correctly. It's possible that 'stat' returns data from an inode not pointed by 'b' anymore, but it won't fail, as posix requires. Once the number of references drops to 0, we are sure no more operations will happen to the inode (otherwise they would hold a reference), so the inode can be safely deleted if there are no hardlinks.

This approach would fix the incosistencies caused by tests similar to the one explained, as well as the deadlock issue (because we can remove the special code that caused the deadlock in first place). It will also make it possible to correctly implement support for O_PATH.

A side effect of the independence of inodes and names

Since Gluster has always considered inodes and names tightly bound, it commonly provides the equivalent path of an inode in multiple operations. This is done mostly at fuse xlator, where an inode is "translated" into a path in many cases, but other xlators also do similar things in some cases.

As we have seen before, this is basically wrong. At the time we receive an inode from the kernel, the original path pointing to that inode may not exist anymore or even point to another inode. So what's the purpose of that path ? best case it won't be used for anything critical (maybe just some logs), but in worst case it could be used to do some operation depending on the path, which could modify a different inode than the intended one.

Scope of the changes required

Because of these two critical problems, we need to change the entire inode management. There are several things we require:

  • Make the reference counting strict and global.

    Right now reference counting inside a process should be strict enough. However we track remote references loosely. For example bricks are not fully aware of the references held by clients. They just keep a pool of inodes in an LRU and reuse them when necessary. This is insufficient to implement deletes based on reference counting, so it will need to be modified. It's not still clear if this will require network protocol changes.

  • Invalidations need to be consistent.

    Rights now, when fuse kernel module sends a forget, we notify all xlators that the inode has been released from the kernel. This means that all references to the inode held by xlators should be dropped, along any cached data, and the invalidation propagated to the brick. The way to do this correctly is by sending the notification similarly to a fop. That's sending it to the top xlator and waiting for a confirmation, or callback. We currently just call all xlators in no specific order and sequentially. This means that if some xlator needs to complete a pending operation before releasing all references to an inode, it will delay the unrefs. However its childs may have already processed the forget and returned. In this case, when the pending operations are processed, new references will be acquired, but there won't be any further forget to force that xlator to release its references, so we may remain with unneeded inode referenced.

    This is also applicable to invalidations coming from the brick all the way through the client's side kernel.

  • We need to remove all path references for inode-based operations.

    If an operation is based on an inode, there shouldn't be any need to reconstruct its path, since there's no guarantee that it is still valid. We should be able to complete the operation without it, as kernel filesystems do.

To do this, we need, at least, the following changes:

  • In the loc_t structure, used to pass inodes along with its parent inode and path, we need to remove the path, as it will never be safe. For entry-based operations we only require the parent inode and the name. For inode-based operations the entire loc_t can be replaced by a single inode_t.

  • The caching of dentries in the core of the inode management needs to be removed, which means that everything depending on it must also be removed. This basically affects the inode_path() function, which is widely used by many xlators.

  • Convert forget requests from kernel into fop-like operations, so that each xlator can fully control when its descendants can safely forget the inode.

  • Make invalidations consistent. We need to control when an invalidation is received up to the kernel and eventually forgotten so that we can precisely track the inode lifecycle and remove all references when necessary, not too early nor too late.

  • We need to forward forgets and invalidations correctly between kernel, clients and bricks.

Proposal

As we can see, the required changes are going to be big and spread roughly among all xlators. From my experience, handling loc_t structures has always been a headache. There's no specification about what can be expected to be present in a loc_t structure for each fop. Sometimes we just have an inode, something we can also have a path, sometimes just a gfid, or any combination of them. Also, the inode->gfid may not be equal than the gfid present in the loc_t. Even worse, depending on which xlator has created the loc_t, its contents may be different, even for the same fop. This means that correctly handling loc_t's is not easy, and each xlator has implemented it in a slightly different way, which contributes to the problem.

Since most of the information we keep in a loc_t is not really necessary (or even problematic) based on the previous explanation and handling these structures is not easy, I propose to replace them by just the data we need:

  • An inode for inode-based operations
  • An inode + name for entry-based operations

That would simplify a lot of things, but it requires a change in the function signatures of almost all fops, which in turn requires the change of many other helper functions that have similar signatures.

This wouldn't be the first time we change the fop signatures, but doing it is a huge change, so we always tried to avoid it. If changing functions signatures were easier, we would have probably done it several times already, mostly to implement small changes or new functionalities that don't really require using xdata (but it was the only thing available without doing big changes).

For this reason I think we should take advantage of this opportunity to simplify function signatures by receiving just a generic struct that contains the required fields for each fop, but the signature of all fops would be the same. This way, modifying the arguments required by a fop in the future will just require to change the structure and where it's used, but all function signatures will remain the same. This can also be used to convert many macros that handle variable argument lists into functions, which are cleaner and easier to debug.

The change is going to be big, but most probably of the same order of magnitude than just replacing loc_t.

By removing the dentries we greatly simplify internal inode management. Probably to the point that inode_link() function becomes a trivial thing. In that case we can use it directly at the bottom of the stack (for example protocol/client) instead of the top level xlator (mount/fuse). This also contributes to make inodes more "stable" since inode->gfid will always contain correct data (to fully guarantee that, we also need to change some function signatures and pass inodes only when really necessary, but this should be a trivial change that could be made as part of the previous signature change).

Using a common structure that contains the arguments for all fops, we can even provide some helper functions that simplify the most common operations each xlator currently does to propagate the requests to its children. Since we will need to rewrite things because of the function signature changes, these helpers may make these changes easier.

The simplification of the internal inode management, combined with the change in its lifecycle and the new requirement of strictly managing references globally, will require significant changes also. This can be used to rethink how we handle reference counting. The most important point would be to make reference counting independent of the inode table's lock. I think this can be a relatively small change compared to everything else, but it could provide important benefits.

This is the summary of the changes proposed (not necessarily in the order they need to be implemented):

  • Define a new structure to carry all the required data for each fop.
  • Provide helper methods for fop management (wind, unwind, ...).
  • Change function signatures and code to use data from the new structure.
  • Implement new xlator_api to load xlators using the new signatures.
  • Remove dentry_t and all its dependencies.
  • Change xlators depending on paths (i.e. anything indirectly depending on dentry_t).
  • Implement forgets and invalidations like fops.
  • Make bricks aware of external references (from clients) and correctly handle inode lifecycle at posix layer.
  • Change inode reference counting to make it more lightweight (remove dependency on inode table's lock).

Probably I missed something, but I think that's enough to have a good understanding of the reasons for this change, its implications, and the scope.

Thanks for reading this loooooong issue 🙄

xhernandez avatar Sep 13 '22 15:09 xhernandez

@amarts @pranithk @mohit84 @gd @mykaul @sunilheggodu

I would like to know your point of view and start the discussion on what's the best way to fix the issues. Do you think there's any alternative way to solve the problems ?

xhernandez avatar Sep 13 '22 15:09 xhernandez

How do we handle backward compatibility and upgrade?

mykaul avatar Sep 13 '22 15:09 mykaul

How do we handle backward compatibility and upgrade?

I've thought about it, but I don't have enough detail about the required changes to know the implications for upgrade and backward compatibility.

The good thing is that most of the changes are big but contained in a single process. All the signature changes are completely transparent outside the process, so clients and servers of different versions should work fine.

The inode reference counting changes have more dependencies. They won't work correctly without both sides with the new version, but we need to see if we can make it work well enough to at least not cause problems during upgrades. This will need further work once we have a deeper understanding of the required changes.

xhernandez avatar Sep 13 '22 21:09 xhernandez

@amarts @pranithk @mohit84 @gd @mykaul @sunilheggodu

I would like to know your point of view and start the discussion on what's the best way to fix the issues. Do you think there's any alternative way to solve the problems ?

Thanks Xavi for the detailed description. I think after taking reference on the brick side we would be able to solve multiple consistency problems on the client side specific to read/access based fops but the challenge is to maintain consistency because It will be a network operation. As we know there are issues with lock xlators, I am not sure how we will maintain consistency during disconnect. The other thing is it will hurt performance significantly if we go with strong consistency. Is it not possible first we can implement a server-based inode reference mechanism instead of doing big changes?

mohit84 avatar Sep 14 '22 05:09 mohit84

@amarts @pranithk @mohit84 @gd @mykaul @sunilheggodu I would like to know your point of view and start the discussion on what's the best way to fix the issues. Do you think there's any alternative way to solve the problems ?

Thanks Xavi for the detailed description. I think after taking reference on the brick side we would be able to solve multiple consistency problems on the client side specific to read/access based fops but the challenge is to maintain consistency because It will be a network operation.

We can think more about this, but I would say that we can do the same we do now: When a brick disconnects any reference it was holding will be released. This might mean that an inode will be deleted and won't be present when the client reconnects. However, if this is a replicated or dispersed volume, self-heal should recreate it. As an optimization we could have a grace time before deleting inodes after disconnection to avoid self-heal for short disconnects, but from the functional point of view I don't see any issue.

As we know there are issues with lock xlators, I am not sure how we will maintain consistency during disconnect.

Do you refer to the deadlocks or the posix lock ?

Deadlocks will be fixed because the code that causes them will be removed after these changes (it's not needed).

For posix locks on reconnects we will have the same problems we have now. I don't see that this change could make it worse.

Can you explain a bit more what problem do you see ?

The other thing is it will hurt performance significantly if we go with strong consistency. Is it not possible first we can implement a server-based inode reference mechanism instead of doing big changes?

The change needs to be done both in the client and server, otherwise things like O_PATH won't work. Also, without client side correctly managing references, the brick won't really know if an inode is still in use or not, so it's of little use.

The increased consistency is a consequence of the better management of inode references. These changes don't enforce extra consistency by using locks in different ways, so there shouldn't be a performance drop because of this.

This is different from the changes needed for a consistent cache. In that case most probably new locks are required, which can hurt performance, but at the same time more request could be served locally from the client, so it will improve performance. But this change is not related to that. I don't expect any change in performance right now.

xhernandez avatar Sep 14 '22 06:09 xhernandez

@amarts @pranithk @mohit84 @gd @mykaul @sunilheggodu I would like to know your point of view and start the discussion on what's the best way to fix the issues. Do you think there's any alternative way to solve the problems ?

Thanks Xavi for the detailed description. I think after taking reference on the brick side we would be able to solve multiple consistency problems on the client side specific to read/access based fops but the challenge is to maintain consistency because It will be a network operation.

We can think more about this, but I would say that we can do the same we do now: When a brick disconnects any reference it was holding will be released. This might mean that an inode will be deleted and won't be present when the client reconnects. However, if this is a replicated or dispersed volume, self-heal should recreate it. As an optimization we could have a grace time before deleting inodes after disconnection to avoid self-heal for short disconnects, but from the functional point of view I don't see any issue.

As we know there are issues with lock xlators, I am not sure how we will maintain consistency during disconnect.

Do you refer to the deadlocks or the posix lock ? Yes, I am referring to the deadlock issue in the lock.

Deadlocks will be fixed because the code that causes them will be removed after these changes (it's not needed).

For posix locks on reconnects we will have the same problems we have now. I don't see that this change could make it worse.

Can you explain a bit more what problem do you see ?

The other thing is it will hurt performance significantly if we go with strong consistency. Is it not possible first we can implement a server-based inode reference mechanism instead of doing big changes?

The change needs to be done both in the client and server, otherwise things like O_PATH won't work. Also, without client side correctly managing references, the brick won't really know if an inode is still in use or not, so it's of little use. Yes, I mean to say on both sides we can implement first to manage reference because our current problem will solve after this. right.

The increased consistency is a consequence of the better management of inode references. These changes don't enforce extra consistency by using locks in different ways, so there shouldn't be a performance drop because of this.

This is different from the changes needed for a consistent cache. In that case most probably new locks are required, which can hurt performance, but at the same time more request could be served locally from the client, so it will improve performance. But this change is not related to that. I don't expect any change in performance right now. ok

mohit84 avatar Sep 14 '22 06:09 mohit84

Assuming that we are going to change the fop signatures to pass a common struct instead of different arguments for each fop, I started working in that direction.

Given that the signature will be the same for all fops, many functions can actually be simplified or even removed (for example most of the functions in defaults.c won't be necessary). This also means that STACK_WIND and STACK_UNWIND macros don't really need to be macros.

Thinking about it and combining it with some other ideas I'm developing for the I/O framework, I propose to create the following infrastructure:

Contexts

Contexts are generic objects, which can be embedded inside other structures or allocated individually, that are used to pass information between a caller and a callee when the execution of the callee may be asynchronous.

They can also be used to pass information between a function and a callback after the execution of a potentially asynchronous operation in the middle.

Contexts can be used to pass any data to callees and to receive the result of an operation when executed as callabcks. They can also be nested so that when the processing of one context is completed, the parent context will be processed.

Schedulers

The contexts by themselves are just a special structure. The schedulers take contexts and information from the caller (like if it's still doing something that requires to be processes as soon as possible or if it doesn't have other work to do, ...) to decide when and where (in which thread) the context will be processed. They can be used to execute some operations asynchronously in the background or with a specified priority.

This concept is basically borrowed from the I/O framework, and it requires other functionalities to be fully operational, but I think it's the right time to introduce it. Without the missing functionalities the only scheduler that can be implemented is one that executes the functions synchronously, but this is exactly what we currently do with STACK_WIND and STACK_UNWIND.

Having the infrastructure there will make it possible to use the additional schedulers quite easily when they are available.

Call and Return

We currently use STACK_WIND and STACK_UNWIND to call and return from fop functions between xlators. Since these macros are not required anymore, we can replace them with special functions that just schedule a call or return to the appropriate function or callback.

Using contexts we don't really need a stack or frames. Xlators will just allocate the private structure they require to handle the operation (commonly stored in frame->local in current implementation) and pass it to the callee or callback using contexts (which can be embedded in the structure). This means that we'll need less than half of memory allocations during processing of a fop.

Fop and Cbk structures

The final component is a structure that will contain all required arguments for each fop. It will basically have enough fields of each type to satisfy the requirements of each fop. Each function will know which ones to use based on what fop it's handling.

Since many xlators just pass the arguments "as is", the contents of these structures can be reused, which means that we can easily avoid many data copies and refs when passing data from one xlator to another.

In case we need to change the arguments of some fop, we just need to change the functions that manage those changed arguments. Everything else will be able to pass the new argument inside the structure in a transparent way without requiring extra changes.


This is a high level approach of the changes I'm planning to do to modify the fop signatures. Let me know if you have any comment or concern.

xhernandez avatar Sep 20 '22 12:09 xhernandez

I've been trying to find a good way to correctly invalidate inodes through fuse to guarantee that cached data is gone. This is necessary to enforce consistency and make sure that acknowledged modifications in one client are not ignored by another client by returning stale cached data. However I'm not able to fully implement it.

Based on my tests, FUSE_NOTIFY_INVAL_INODE invalidates inode data and metadata, but the inode itself remains in the cache (no FORGET is sent). The only difference is that a new GETATTR request is sent when the inode is needed again. When FUSE_NOTIFY_INVAL_ENTRY (FUSE_NOTIFY_DELETE behaves in the same way) is sent, the entry is removed from the cache and any future attempt to access it will cause a new LOOKUP request, but no FORGET is sent in this case either. Even combining both notifications (invalidate the entry and the inode), I can't force fuse to fully remove the inode and send a FORGET.

Without a FORGET message it's hard to determine when the kernel has really flushed all pending data after invalidating it, so it's hard to implement full consistency and clean up resources.

Maybe I've missed something, but I've done several tests and even looked at the kernel fuse code, but I can't fully understand why FORGET is not sent, or how to guarantee consistency without it.

The only way I've found to get a FORGET (without forcing drop_caches or similar methods) is to delete an entry, send the invalidations, and then recreate it with another inode. In this case, when the directory is read, fuse sends a FORGET for the old inode. However this doesn't seem useful and it's outside the control of the fuse filesystem. It seems like there's some additional reference kept by the fuse kernel module that is not released after invalidation, preventing the FORGET message.

@csabahenk do you have any idea ?

xhernandez avatar Mar 08 '23 17:03 xhernandez

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] avatar Oct 15 '23 13:10 stale[bot]