libnfs icon indicating copy to clipboard operation
libnfs copied to clipboard

Optimizing file access by caching fh3 structs

Open santagada opened this issue 2 years ago • 7 comments

Hi,

I've been trying to optimize nfs access by caching fh3 results for folders to speed up lookups. As I see right now there's no place to add a cache for those in the libnfs code is there? The way I would implement it would also need to be thread safe and shared on all threads (I have one nfs contex/connection/socket per thread). Is anything like that planned? Anyone ever thought of adding it?

santagada avatar Oct 11 '21 13:10 santagada

On Mon, Oct 11, 2021 at 11:03 PM Leonardo Santagada < @.***> wrote:

Hi,

I've been trying to optimize nfs access by caching fh3 results for folders to speed up lookups. As I see right now there's no place to add a cache for those in the libnfs code is there? The way I would implement it would also need to be thread safe and shared on all threads (I have one nfs contex/connection/socket per thread). Is anything like that planned? Anyone ever thought of adding it?

It is hard to do this type of caching safely for NFSv3 since there are no leases or cache coherency protocol. What some clients do is to cache things for, say up to a second by default, and hope for the best. But it is unsafe.

I don't have any immediatel plans for this but it might be worthwhile as an optional feature for some workloads.

On the topic of multithreading, please see the pthread branch. It is a wip implementation to make libnfs multithread safe so that you can use a single context and use it across many/all your threads. Don;t use it on production data just yet, but please try it out if you are interested.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sahlberg/libnfs/issues/366, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADY3EB7WXYJ3Y7JYYXDR5LUGLODLANCNFSM5FYGFHZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

sahlberg avatar Oct 11 '21 13:10 sahlberg

oh I thought the fh3 structures were opaque but guaranteed stable (with the caveat that it was the folder at that path in the past).

we are on windows so if this is pthread specific it won't work (and we want to move to production data, so I think well first go with 4.0.0).

Any chance to be able to plug our own memory manager? I see that a lot of the functions do malloc by themselves so it would have been nice to plug our memory manager (we are using jemalloc).

santagada avatar Oct 11 '21 14:10 santagada

On Tue, Oct 12, 2021 at 12:58 AM Leonardo Santagada < @.***> wrote:

oh I thought the fh3 structures were opaque but guaranteed stable (with the caveat that it was the folder at that path in the past).

They are mostly stable. But think of them as mostly being "large and versioned inode numbers". They are stable until they are dropped serverside and become -ESTALE. There are many reasons why they can become -ESTALE ,including something as simple as memory pressure on the server, in which case you need to remember the original file name and flags so that you can transparently reopen the file again and get the new/updated filehandle. (libnfs does not do this today, but if we add fh caching we should do it properly) Caching them is very difficult to do correctly in NFSv3. In NFSv4 by using leases it would be easier.

If we think about them as being versioned inodes, these are some of the issues that can happen when you cache them :

...
f = open("/very/long/path/to/file", O_RDWR|O_CREAT)  (1)    // you get

a file handle for this close(f) rename("/very/long/path/to/file", "/very/different/path/to/file") // possibly done by a different client so we might have no way of knowing // will not change the filehandle for "file" f = open("/very/long/path/to/file", O_RDWR) (*)

In this case, if we are just caching the filehandles, even if we limit it to only cache them for 1 second, at the place of (*) the cache will allow us to successfully open a nonexisting path and give us a filehandle for a different path.

What clients have to do here is basically cache the filehandle in (1) but then in (*), it will need to validate the handle during open() so the open becomes : NFS3 GETATTR("/") validate the handle for "/" NFS3 GETATTR('/very") NFS3 GETATTR(('/very/long') NFS3 GETATTR('/very/log/path') NFS3 GETATTR('/very/log/path/to') NFS3 GETATTR('/very/log/path/to/file') I.e. it will still need to go and validate that all the filehandles from root to target are still all valid. In the rename case above, the filehandle for '/very/long' would have become STALE and is how the client would know that the open() should fail with an error.

Still, even by doing these kind of validations using GETATTR, it is still a performance boost compared to no caching at all. Since in the case where the cache on a client is completely empty, the process to handle the open() would take almost twice as many roundtrips: NFS3 GETATTR("/") validate the handle for "/" NFS3 LOOKUP('/very) NFS3 GETATTR('/very") NFS3 LOOKUP('/very/long') NFS3 GETATTR(('/very/long') ....

We could do something similar in libnfs and it would be as safe as what most kernel clients do, but it would require that we cache both the fh3 as well as the directory entry name. To allow us to skip doing the NFS3 LOOKUP if we already have a cached handle for 'long' in the directory '/very'. I.e. we cache not full paths but components, and each directory has its own cache for all the content. IF we have a cached value for "component" in a direcotry, then we can skip the LOOPUP that is part of the LOOKUP(component) GETATTR(component) pair.

we are on windows so if this is pthread specific it won't work (and we want to move to production data, so I think well first go with 4.0.0).

Any chance to be able to plug our own memory manager? I see that a lot of the functions do malloc by themselves so it would have been nice to plug our memory manager (we are using jemalloc).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sahlberg/libnfs/issues/366#issuecomment-940106136, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADY3EC2RNL4SFIO6IQGLI3UGL3PVANCNFSM5FYGFHZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

sahlberg avatar Oct 11 '21 21:10 sahlberg

Thinking about it more. This is what you can do.

First, path lookups, like in open() is very expensive in NFS because the protocol does not handle paths. So anything that handles a path will decompose to being a sequence of LOOKUPs one component at a time until you reach the target. Or sometimes you have a combination of LOOKUP+GETATTR.

(LOOKUP may return the attributes of the object, thus making the GETATTR redundant but there is no guarantee, lets assume it does return the attributes for now)

This means that looking up a path takes one roundtrip for each path component, or two roundtrips per component if you have to do an explicit GETATTR to find the attributes. Similarly, if a client needs to validate cached handles/attributes for a path, then usually the client would issue a sequence of GETATTR, one for each cached component and check that the mtime has not changed. If mtime has changed you have to discard the caches and re-lookup the target and all intermediate directories.

What you could do in libnfs safely, since we have a powerful async interface available, is to reduce this to just a single roundtrip regardless of how deep the path is. To do this, what you need to do is 1, for all directories, add a cache for all discovered direntries (discovered by LOOKUP or READDIRPLUS) where you store the filehandle as well as the attributes, keyed by the component name. 2, For all the path-based commands, like nfs3_open_async(), normalize the path and then check IF you have cached entries for every single component in the path. If you do, then you can: 2.1: for each component, send an async LOOKUP of the component in its parent handle. 2.2: for each reply received, validate that the LOOKUP was successful. I.e. the name exists, the filehandle matches the cached value and the mtime is unchanged. 2.3: once all replies are received, if all the tests in 2.1 were suuccessful then you can use the cached handle.

The trick here is that using the async interface you can do 2.1 - 2.3 completely in parallell for all the components in the path. This approach would allow you to safely validate and use a cached filehandle after spending just one roundtrip to validate the entire path. There would be a pretty decent amount of work to implement this but it would work and it would speed up handling of deep paths.

On Tue, Oct 12, 2021 at 7:56 AM ronnie sahlberg @.***> wrote:

On Tue, Oct 12, 2021 at 12:58 AM Leonardo Santagada < @.***> wrote:

oh I thought the fh3 structures were opaque but guaranteed stable (with the caveat that it was the folder at that path in the past).

They are mostly stable. But think of them as mostly being "large and versioned inode numbers". They are stable until they are dropped serverside and become -ESTALE. There are many reasons why they can become -ESTALE ,including something as simple as memory pressure on the server, in which case you need to remember the original file name and flags so that you can transparently reopen the file again and get the new/updated filehandle. (libnfs does not do this today, but if we add fh caching we should do it properly) Caching them is very difficult to do correctly in NFSv3. In NFSv4 by using leases it would be easier.

If we think about them as being versioned inodes, these are some of the issues that can happen when you cache them :

...
f = open("/very/long/path/to/file", O_RDWR|O_CREAT)  (1)    // you get

a file handle for this close(f) rename("/very/long/path/to/file", "/very/different/path/to/file") // possibly done by a different client so we might have no way of knowing // will not change the filehandle for "file" f = open("/very/long/path/to/file", O_RDWR) (*)

In this case, if we are just caching the filehandles, even if we limit it to only cache them for 1 second, at the place of (*) the cache will allow us to successfully open a nonexisting path and give us a filehandle for a different path.

What clients have to do here is basically cache the filehandle in (1) but then in (*), it will need to validate the handle during open() so the open becomes : NFS3 GETATTR("/") validate the handle for "/" NFS3 GETATTR('/very") NFS3 GETATTR(('/very/long') NFS3 GETATTR('/very/log/path') NFS3 GETATTR('/very/log/path/to') NFS3 GETATTR('/very/log/path/to/file') I.e. it will still need to go and validate that all the filehandles from root to target are still all valid. In the rename case above, the filehandle for '/very/long' would have become STALE and is how the client would know that the open() should fail with an error.

Still, even by doing these kind of validations using GETATTR, it is still a performance boost compared to no caching at all. Since in the case where the cache on a client is completely empty, the process to handle the open() would take almost twice as many roundtrips: NFS3 GETATTR("/") validate the handle for "/" NFS3 LOOKUP('/very) NFS3 GETATTR('/very") NFS3 LOOKUP('/very/long') NFS3 GETATTR(('/very/long') ....

We could do something similar in libnfs and it would be as safe as what most kernel clients do, but it would require that we cache both the fh3 as well as the directory entry name. To allow us to skip doing the NFS3 LOOKUP if we already have a cached handle for 'long' in the directory '/very'. I.e. we cache not full paths but components, and each directory has its own cache for all the content. IF we have a cached value for "component" in a direcotry, then we can skip the LOOPUP that is part of the LOOKUP(component) GETATTR(component) pair.

we are on windows so if this is pthread specific it won't work (and we want to move to production data, so I think well first go with 4.0.0).

Any chance to be able to plug our own memory manager? I see that a lot of the functions do malloc by themselves so it would have been nice to plug our memory manager (we are using jemalloc).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sahlberg/libnfs/issues/366#issuecomment-940106136, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADY3EC2RNL4SFIO6IQGLI3UGL3PVANCNFSM5FYGFHZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

sahlberg avatar Oct 12 '21 00:10 sahlberg

On Tue, Oct 12, 2021 at 10:13 AM ronnie sahlberg @.***> wrote:

Thinking about it more. This is what you can do.

First, path lookups, like in open() is very expensive in NFS because the protocol does not handle paths. So anything that handles a path will decompose to being a sequence of LOOKUPs one component at a time until you reach the target. Or sometimes you have a combination of LOOKUP+GETATTR.

(LOOKUP may return the attributes of the object, thus making the GETATTR redundant but there is no guarantee, lets assume it does return the attributes for now)

This means that looking up a path takes one roundtrip for each path component, or two roundtrips per component if you have to do an explicit GETATTR to find the attributes. Similarly, if a client needs to validate cached handles/attributes for a path, then usually the client would issue a sequence of GETATTR, one for each cached component and check that the mtime has not changed. If mtime has changed you have to discard the caches and re-lookup the target and all intermediate directories.

What you could do in libnfs safely, since we have a powerful async interface available, is to reduce this to just a single roundtrip regardless of how deep the path is. To do this, what you need to do is 1, for all directories, add a cache for all discovered direntries (discovered by LOOKUP or READDIRPLUS) where you store the filehandle as well as the attributes, keyed by the component name. 2, For all the path-based commands, like nfs3_open_async(), normalize the path and then check IF you have cached entries for every single component in the path. If you do, then you can: 2.1: for each component, send an async LOOKUP of the component in its parent handle. 2.2: for each reply received, validate that the LOOKUP was successful. I.e. the name exists, the filehandle matches the cached value and the mtime is unchanged. 2.3: once all replies are received, if all the tests in 2.1 were suuccessful then you can use the cached handle.

The trick here is that using the async interface you can do 2.1 - 2.3 completely in parallell for all the components in the path. This approach would allow you to safely validate and use a cached filehandle after spending just one roundtrip to validate the entire path. There would be a pretty decent amount of work to implement this but it would work and it would speed up handling of deep paths.

Or in 2.1 and 2.2, might be even better to just use GETATTR instead of LOOKUP and just check the mtime. The trick is that IF you have all of components in cache already, then you can do these checks concurrently in a single roundtrip.

On Tue, Oct 12, 2021 at 7:56 AM ronnie sahlberg @.***> wrote:

On Tue, Oct 12, 2021 at 12:58 AM Leonardo Santagada < @.***> wrote:

oh I thought the fh3 structures were opaque but guaranteed stable (with the caveat that it was the folder at that path in the past).

They are mostly stable. But think of them as mostly being "large and versioned inode numbers". They are stable until they are dropped serverside and become -ESTALE. There are many reasons why they can become -ESTALE ,including something as simple as memory pressure on the server, in which case you need to remember the original file name and flags so that you can transparently reopen the file again and get the new/updated filehandle. (libnfs does not do this today, but if we add fh caching we should do it properly) Caching them is very difficult to do correctly in NFSv3. In NFSv4 by using leases it would be easier.

If we think about them as being versioned inodes, these are some of the issues that can happen when you cache them :

...
f = open("/very/long/path/to/file", O_RDWR|O_CREAT)  (1)    // you

get a file handle for this close(f) rename("/very/long/path/to/file", "/very/different/path/to/file") // possibly done by a different client so we might have no way of knowing // will not change the filehandle for "file" f = open("/very/long/path/to/file", O_RDWR) (*)

In this case, if we are just caching the filehandles, even if we limit it to only cache them for 1 second, at the place of (*) the cache will allow us to successfully open a nonexisting path and give us a filehandle for a different path.

What clients have to do here is basically cache the filehandle in (1) but then in (*), it will need to validate the handle during open() so the open becomes : NFS3 GETATTR("/") validate the handle for "/" NFS3 GETATTR('/very") NFS3 GETATTR(('/very/long') NFS3 GETATTR('/very/log/path') NFS3 GETATTR('/very/log/path/to') NFS3 GETATTR('/very/log/path/to/file') I.e. it will still need to go and validate that all the filehandles from root to target are still all valid. In the rename case above, the filehandle for '/very/long' would have become STALE and is how the client would know that the open() should fail with an error.

Still, even by doing these kind of validations using GETATTR, it is still a performance boost compared to no caching at all. Since in the case where the cache on a client is completely empty, the process to handle the open() would take almost twice as many roundtrips: NFS3 GETATTR("/") validate the handle for "/" NFS3 LOOKUP('/very) NFS3 GETATTR('/very") NFS3 LOOKUP('/very/long') NFS3 GETATTR(('/very/long') ....

We could do something similar in libnfs and it would be as safe as what most kernel clients do, but it would require that we cache both the fh3 as well as the directory entry name. To allow us to skip doing the NFS3 LOOKUP if we already have a cached handle for 'long' in the directory '/very'. I.e. we cache not full paths but components, and each directory has its own cache for all the content. IF we have a cached value for "component" in a direcotry, then we can skip the LOOPUP that is part of the LOOKUP(component) GETATTR(component) pair.

we are on windows so if this is pthread specific it won't work (and we want to move to production data, so I think well first go with 4.0.0).

Any chance to be able to plug our own memory manager? I see that a lot of the functions do malloc by themselves so it would have been nice to plug our memory manager (we are using jemalloc).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sahlberg/libnfs/issues/366#issuecomment-940106136, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADY3EC2RNL4SFIO6IQGLI3UGL3PVANCNFSM5FYGFHZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

sahlberg avatar Oct 12 '21 00:10 sahlberg

it all makes sense, just the amount of data you need to cache is huge, as these fh3 are up to 64 bytes long (and in libnfs there's an int for the size as well).

in our particular use cases folders never move around so I'll do a simple cached_open function that will cache the last folder to files and that will be enough for us. Also only storing a murmurhash of the path instead of actual strings to make this a uniform sized hashmap.

knowing about ESTALE will help a lot, thanks for sharing all of this. (and thanks again for the lib, implementing nfs from scratch would have been very painful).

santagada avatar Oct 13 '21 09:10 santagada

On Wed, Oct 13, 2021 at 7:11 PM Leonardo Santagada @.***> wrote:

it all makes sense, just the amount of data you need to cache is huge, as these fh3 are up to 64 bytes long (and in libnfs there's an int for the size as well).

in our particular use cases folders never move around so I'll do a simple cached_open function that will cache the last folder to files and that will be enough for us. Also only storing a murmurhash of the path instead of actual strings to make this a uniform sized hashmap.

knowing about ESTALE will help a lot, thanks for sharing all of this. (and thanks again for the lib, implementing nfs from scratch would have been very painful).

Sounds good. That use case is too specialized so it would not be a good fit in the library but the application should be able to deal with it outside of the library. Just, I would avoid using a hash as a replacement for the actual path. One day, sooner or later, you may win the lottery and encounter a hash collission and everything will become sad.

You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sahlberg/libnfs/issues/366#issuecomment-942094240, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADY3EAMZDSJZDQ6IHWAVYTUGVELRANCNFSM5FYGFHZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

sahlberg avatar Oct 13 '21 09:10 sahlberg