UnifyFS Lamination and consistency model

This week, I've been working on the new metadata implementation (I will just call margotree) which will replace our current mdhim. And, it looks like we need to establish our consistency model clearly.

A notable difference in margotree metadata from using mdhim is on how the metadata is distributed across compute nodes. With mdhim, the UnifyFS server does not have any dedicated data structures for the file metadata but simply relies on mdhim to store or retrieve them. Therefore, when the UnifyFS client library sends the metadata (upon sync()), the server simply put the records onto the kvstore (mdhim). However, this single step process becomes two step process in the margotree metadata. Specifically, in the margotree metadata, the UnifyFS server itself has a new data structure, i.e., unifyfs_inode, which keeps file metadata including stat and extent information. So, when a server daemon receives the indexing metadata records (or extent map) from clients, it first stores them on its local inode structure (step 1). Then, we have have to decide when we have to share the extent tree with other server daemons (step 2). In the current margotree implementation, both step 1 and 2 happen with sync(), which is actually triggered when an application calls fsync(). However, according to our documentation, this doesn't seem to be a correct way.

Our documentation says that application file operations should happen in two strictly-divided write and read phases. Also, an explicit lamination needs to be placed between those two phases. A typical example would be:

fd = open(file, O_WRONLY);
write(fd, data, len);
fsync(fd);
close(fd);

chmod(file, 0444);  // laminate the file, no further changes

fd = open(file, O_RDONLY);
read(file, data, len);
close(fd);

Currently, UnifyFS disregard the file descriptor (fd) with fsync() and syncs all file indexing metadata buffered in the client to the metadata storage (mdhim). The documentation says that fsync() sends file indexing information from the client library to the server daemon, but it does not say that any subsequent read operation should correctly fetch remote data. If this is not true, we would need to update the documentation accordingly. And, if the documentation is correct as it is now, fsync() feels unnecessary in the above example because the buffered indexing information is expected to be pushed anyway to the server on close(). The documentation also says that:

If the application process group fails before a file has been laminated, UnifyFS may delete the file.

This implies that fsync() also has nothing to do with persisting the file data.

So, back to the original problem of the margotree, I am wondering what is the expected way for the UnifyFS server daemon to handle extent records from clients. Assuming that we will have a proper fsync() implementation, one reasonable option would be:

fsync(): send buffered extent records to the server daemon, which will keep in its local inode data structure (step 1)
close(): synchronize extent information between all servers (step 2)

or , it's also possible to be like:

fsync(): do nothing
close(): send buffered extent records to server (step 1)
laminate/chmod(): synchronize between servers (step 2)

I do not have any particular opinion here, but it would be more likely for application developers to miss the lamination/chmod() call than missing close(). But, there would be also cases where the explicit lamination is necessary, for instance, when an application opens a file multiple times just for writing. We may also think of having a custom open flags for lamination, e.g., O_LAMINATE_ON_CLOSE or O_NO_LAMINATE_ON_CLOSE, if this makes sense.

Anyway, I thought it was a good time for discussing this before moving forward. Please share your thoughts.

Jun 12 '20 20:06 sandrain

Thanks for the write up, @sandrain . I actually think we should do all of the above and more.

Our documentation was written for what we had hoped would be true. We had hoped we might buffer all writes locally during the pre-lamination phase, then flush all writes to make them visible on laminate, and then not have to deal with any writes post-lamination.

Let's call this "writes visible after laminate".

Then we got our first customer in HDF. The HDF group said two things:

They need stat() to return an accurate file size between open/close calls and before the file is laminated.
They'd prefer to have writes be visible after fsync. That's needed to deal with reading back metadata they may flushed to disk while still writing the file.

We were able to accomplish both of these by flushing index data to MDHIM on fsync, which made both the file size and writes visible after fsync.

Let's call this "writes visible after fsync".

There is actually a third option we could do for HDF, where we could support an accurate file size for stat after an fsync, but only make writes visible after laminate. In that case, we support requirement (1) but not (2). HDF mentioned they could do refactoring work to avoid requirement (2) but it sounds like (1) is a hard requirement. Also, it sounds like the work for dropping (2) is possible but perhaps not trivial. Anyway, to support that, we'd need the file size allreduce across servers to report the correct value, but we wouldn't have to exchange extents across servers until laminate.

Let's call this thrid option "file size visible after fsync, writes visible after laminate".

Ultimately, I think we should support all three (and more we haven't thought of) and let the user choose. I'm guessing HDF won't be the last customer to ask for "writes visible after fsync", so I think we'll be stuck with supporting that mode. However, if there are apps out there that can accept "writes visible after laminate", then we could likely enable a bunch of optimizations to boost their performance. Supporting both modes lets people start using UnifyFS while their app still requires "writes visible after fsync" but then gives them incentive to change their app to work towards "writes visible after laminate".

With that, we might define:

"writes visible after fsync"

fsync: flush write index info across servers
laminate: put file into "no more writes" mode

"writes visible after laminate"

fsync: cache writes locally, without updating remote servers, perhaps cache on client, so we don't even update the local server?
laminate: flush write index data across servers and put file into "no more writes" mode

"file size visible after fsync, writes visible after laminate"

fsync: local server at least needs to track the max extent so that file size allreduce across servers is accurate
laminate: flush write index data across servers and put file into "no more writes" mode

I don't think we need to support all modes from the start, but we should plan so that we can add them later when/if we get time. To do this, we'll have our internal functions as you describe (step 1 and step 2), and then we can have our wrappers map to those in different ways, which will be configurable the user.

For now, we are following the requirements that HDF has given us. We'll see what MPI I/O might throw on top of that.

Jun 12 '20 21:06 adammoody

To add on a bit more.

It's possible for an app to fsync a file more than once before laminate, and each of those will update the file size or expose more writes. So we have to expect that a client may push data to the server multiple times for a given file.

Also, we currently have an implied fsync for a bunch of operations. If there are pending writes (a write but no fsync) and then the client calls things like close, stat, read, etc, we now fsync the file behind the scenes for them. We might want that implied fsync to be optional and configurable. If fsync is expensive and the user doesn't need it, then they can turn it off and rely on laminate. For instance, avoiding that implied fsync would help apps that might do something like:

for (1 million times)
  fd = open("foo")
  lseek(fd, somewhere)
  write(fd)
  close(fd)
laminate("foo")

Jun 12 '20 21:06 adammoody

Thanks @adammoody for clarifying this. It makes sense to make those three modes (for now) configurable, and we can start to support one by one.

So, we need to maintain a consistent file view across compute nodes at a certain level even before lamination, e.g., file size or contents. Considering that exchanging extent information is costly, how about making the metadata server work in two different mode:

Before the lamination: dedicate one server to handle all metadata service, including attributes and extent information, like we currently do in mdhim.
After the lamination: broadcast the file metadata to all servers.

For instance, before the lamination, all metadata for file A is going to be stored in a single server, e.g., server N, determined by hash(gfid(A)). When lamination happens, we broadcast the metadata from server N to all servers.

This will avoid some all-to-all communications during the write phase (before the lamination), especially if not all application processes are reading data back. And, after the lamination, we still do not need any communications for metadata queries.

We can discuss further in the meeting.

Jun 13 '20 19:06 sandrain

@adammoody I just want to make one thing clear:

Then we got our first customer in HDF. The HDF group said two things:

They need stat() to return an accurate file size between open/close calls and before the file is laminated.

They'd prefer to have writes be visible after fsync. That's needed to deal with reading back metadata they may flushed to disk while still writing the file.

We were able to accomplish both of these by flushing index data to MDHIM on fsync, which made both the file size and writes visible after fsync.

So, what HDF wants is in fact stat() in any moment returns a correct file size, but we plan to give a correct size for the stat() call only after calling fsync() first?

Jun 15 '20 18:06 sandrain

@adammoody I just want to make one thing clear:

Then we got our first customer in HDF. The HDF group said two things:

They need stat() to return an accurate file size between open/close calls and before the file is laminated.

They'd prefer to have writes be visible after fsync. That's needed to deal with reading back metadata they may flushed to disk while still writing the file.

We were able to accomplish both of these by flushing index data to MDHIM on fsync, which made both the file size and writes visible after fsync.

So, what HDF wants is in fact stat() in any moment returns a correct file size, but we plan to give a correct size for the stat() call only after calling fsync() first?

It wasn't quite as bad as having stat() return the accurate file size at any time. They need something like the following to work:

fd = open("foo")
write(fd)
close(fd)
MPI_Barrier(MPI_COMM_WORLD);
stat("foo");

They will ensure that all writes are synced across client processes before they call stat().

Jun 15 '20 19:06 adammoody

So, in addition to the basic consistency model (lamination), we want to support write-visible-on-fsync and size-visible-on-fsync.

write-visible-on-fsync: data written by a process who invokes fsync() will be visible by other processes (including processes in remote compute nodes).
size-visible-on-fsync (or size-visible-before-lamination): the global file size is correctly reported by stat() before the file lamination. we would need a system call that triggers UnifyFS to get the global file size, e.g., close() or fsync().

I think the write-visible-on-fsync is quite straightforward, since we only need to sync extent information from the calling client (instead of all clients). size-visible-on-fsync maybe a bit trickier because we would need to let all clients to inform their file size.

Jun 16 '20 13:06 sandrain

size-visible-on-fsync maybe a bit trickier because we would need to let all clients to inform their file size.

I think we could treat the file size in the same way for consistency -- a process needs to call fsync in order to expose any changes it made that affect file size to be visible by other processes. It's as though none of the changes a process makes is guaranteed to happen to the underlying file until there is an explicit or implied fsync.

// consider that all processes in MPI_COMM_WORLD do this
fd = open("foo")
lseek(fd, random_location)
write(fd)
close(fd) <-- assume an implied fsync here

// this stat call will return the size which at least accounts for the writes
// from the calling process, but there is no guarantee that it accounts
// for writes by any other processes
stat("foo");

MPI_Barrier(MPI_COMM_WORLD);

// however, this stat will report a file size that accounts for writes by *all* processes
// 1) all processes have an implied fsync in the close
// 2) in order for this process to pass the barrier,
//    we know all processes have returned from close (and their implied fsync)
stat("foo");

Jun 16 '20 17:06 adammoody

Thanks, @adammoody. I think that's a good example that you gave :-)

Jun 16 '20 22:06 sandrain

Just to keep this on our radar, since it may affect changes we're making before we get there...

For large scale, we'll need to allow for distributing the extent data across servers, rather than replicate the data on every server. The total number of extents in a file will often scale linearly with the number of client processes O(P), so ramping up the process count will bring a corresponding increase in the number of extents. A large enough job will overload our servers if each one holds all data.

For distributing extents, we can employ the same approach we use with MDHIM where we slice the logical file offset ranges and assign different ranges to be tracked by different servers. On a read, we then need to do rpcs to the appropriate servers to lookup the extent info.

Our current strategy to bcast all extent data info to all servers will still be a very important optimization that can be used when the extent count is small enough. This optimization significantly speeds up reads, since all extent look ups are local operations.

And of course there is a middle ground between those two extremes where multiple servers, but not all, each track certain offset ranges. That allows for distributing lookups to common ranges across multiple servers.

Jun 23 '20 20:06 adammoody

Random thought while reading this thread:

In addition to the things already discussed, should we also consider optimistically syncing seg_tree extents in the background? That way, when they do call sync() or laminate, there's fewer extents to sync. We could make this feature toggle-able.

Jun 26 '20 00:06 tonyhutter

UnifyFS UnifyFS copied to clipboard

Lamination and consistency model

UnifyFS
UnifyFS copied to clipboard