glusterfs icon indicating copy to clipboard operation
glusterfs copied to clipboard

Improve directory access

Open xhernandez opened this issue 4 years ago • 33 comments

Accessing a directory can be slow when DHT has many subvolumes because all of them need to be scanned sequentially.

One way to improve that is to create a special hidden file inside each directory that will contain all the entries for that directory along the gfid. This file needs to be kept in sync with the real directory contents, but it will provide a huge advantage because when a readdir(p) is requested, instead of iterating all the subvolumes, the special file will be present only on one of the subvolumes and it can be read really fast as any other file.

It also solves the consistency issue when one subvolume is down: It will show all the contents of the directory, even those that were present in the missing subvolume, or it won't show any entry at all (it will return an error) when the special file was on the missing subvolume. This way we may return some errors when subvolumes are down (this is expected), but if data is returned, it will always be consistent.

xhernandez avatar Feb 25 '21 09:02 xhernandez

@xhernandez great idear.arf and ec base on dht volume.that will improve all directory operation fast

perrynzhou avatar Feb 26 '21 01:02 perrynzhou

Adding few more details here for discussion and reference. For people who were closely working with RIO (#243) / MDSDS (#816), the topic may sound similar. But this approach would keep the graph same, instead of separating data and metadata graph.

First goal of this is to make sure all directory information stays in 1 place, instead of being spread into multiple subdirectories, that way readdir can be done from just one subvolume of distribute instead of sending it across to all subvolumes like today! Just by doing it, we have many side effects which works in +ve way for Gluster. Lets discuss them in detail later. This design aligns with gluster's approach of keeping data and metadata together (in this case, data of directory).

First, only at distribute layer, (or as a new xlator above dht, like shard) we need to separate out the entry create calls into dentry create and file create operations. So, all dentry create operations are done on subvolume having parent. Based on the success of this operation, we can do the data create on particular subvolume where it hashes. If we start hashing this based on gfid used in dentry create, we can solve the issue of renames, and hardlinks, sticky bits and link-to files in Distribute :-)

For the benefit of performance at disk level, we can keep this information in a file with just index, name, type, gfid as the fields as rows (or for better perf may be berkley db?). like .glusterfs/, this can be a file in each directory, so everyone is aware where to look for, say .gf-entry-index or something which suits the need.

So, if we look at what needs to change, there would be a change in lookup, as 'name' lookup happens on parent and basename (where basename is searched in a file at the posix layer), and all gfid lookup (ie, everything after namelookup succeeds), will happen on where the gfid is hashed.

The challenge is mainly in implementing the file format efficiently, so deletes doesn't cause random access issues in the .gf-entry-index file, and also we need to provide faster way to search for an given name in the file!

If we provide a mechanism to build .gf-entry-index file if not available, then this feature can be backward compatible too.

@xhernandez anything you want to add more on it?

@gluster/gluster-all please consider having a look on what changes in your translator due to this change, and how it can be done without disturbing current model a lot.

amarts avatar Mar 03 '21 03:03 amarts

So essentially this metadata file is a database, that needs ACID, fast appends, indexing, journaling(?), compaction, etc. I assume we are going to use an existing implementation?

mykaul avatar Mar 03 '21 07:03 mykaul

So essentially this metadata file is a database, that needs ACID, fast appends, indexing, journaling(?), compaction, etc.

Yes, it is 'structured' data. Because the format is well defined, and it has to support above properties.

I assume we are going to use an existing implementation?

Using existing is ideal. I don't think someone can afford time for writing another database implementation! But as we haven't started any work on this, open to suggestions :-)

amarts avatar Mar 09 '21 10:03 amarts

I'm not sure if we really need all the database features. I fear that adding a database-like approach could be overkill (to start with, how will we use a database library from client side without a filesystem ? will we need to hook all filesystem accesses done by the library to access its files ? how internal lock management will work when it's used with other clients ?)

The data we need to keep is very simple. For readdir(p) operations it will be read sequentially. Lookups won't access it at all. Additions can be easily implemented by appending. The only "complex" operation are deletes, but I don't think this justifies a full db implementation.

Just to optimize disk requests we could use a b-tree implementation for big directories (> 4000 entries), and a simple sequential file for smaller ones.

xhernandez avatar Mar 09 '21 12:03 xhernandez

@xhernandez As per my recent understanding of crawling on a disk filesystem, this feature will improve performance of readdirp only when a directory has < 128KB of data per brick which can be responded to the user.

  • Directory with lots of directories
  • Directory with very less number of files.

So IMO the main reason to implement this should be consistency when a node goes down.

pranithk avatar Mar 09 '21 12:03 pranithk

@xhernandez As per my recent understanding of crawling on a disk filesystem, this feature will improve performance of readdirp only when a directory has < 128KB of data per brick which can be responded to the user.

Why ?

We can easily fit 4000 entries (it depends on the length of file names, though) in a single 128 KiB disk block. This means that we can serve them to the user with a single backend request, even if it takes more that one request to send them through fuse.

  • Directory with lots of directories
  • Directory with very less number of files.

So IMO the main reason to implement this should be consistency when a node goes down.

That's not the main reason to me. What we need to be sure is that we don't return bad/stale data, but if the data gets corrupted or out of date because of a crash, we simply fail back to the filesystem crawl to regenerate it, which will still exist and needs to be maintained anyway.

There are many other benefits using this approach:

  • It never returns partial data in case of a down subvolume (specially on pure distributed volumes).
  • It doesn't require an additional disk access to get the gfid of each file (it's included in the directory data inside the file).
  • Performance doesn't degrade as the distribution factor increases.
  • It avoids many small requests that add a lot of latency

I see this change as a performance improvement. Consistency is not so important as long as we are able to detect the inconsistency and regenerate the file.

You can think about this as a cache, not as a source of truth.

xhernandez avatar Mar 09 '21 13:03 xhernandez

So essentially this metadata file is a database, that needs ACID, fast appends, indexing, journaling(?), compaction, etc.

Yes, it is 'structured' data. Because the format is well defined, and it has to support above properties.

I assume we are going to use an existing implementation?

Using existing is ideal. I don't think someone can afford time for writing another database implementation! But as we haven't started any work on this, open to suggestions :-)

Ceph is using https://rocksdb.org/ - could be an overkill for Gluster's purposes.

@xhernandez - without a DB, I'm not sure how you can recover from partial updates, rollback, etc. Even simple files format need maintenance - from compaction, to atomic updates, etc.

mykaul avatar Mar 09 '21 13:03 mykaul

@xhernandez As per my recent understanding of crawling on a disk filesystem, this feature will improve performance of readdirp only when a directory has < 128KB of data per brick which can be responded to the user.

Why ?

We can easily fit 4000 entries (it depends on the length of file names, though) in a single 128 KiB disk block. This means that we can serve them to the user with a single backend request, even if it takes more that one request to send them through fuse.

Let's assume that 128KB can be used to serve X files.

What I meant by my previous comment is that this enhancement will show better performance for readdir when the directory has less than dht-subvol-count * X files.

readdirp I think is a separate beast because we may have to implement bulk lookup on entries falling on different bricks to reduce the number of network operations to get size of files and other xattrs that maybe necessary.

  • Directory with lots of directories
  • Directory with very less number of files.

So IMO the main reason to implement this should be consistency when a node goes down.

That's not the main reason to me. What we need to be sure is that we don't return bad/stale data, but if the data gets corrupted or out of date because of a crash, we simply fail back to the filesystem crawl to regenerate it, which will still exist and needs to be maintained anyway.

There are many other benefits using this approach:

  • It never returns partial data in case of a down subvolume (specially on pure distributed volumes).

This is what I meant by consistency. Should have used better term. Consistency is overloaded term :-(.

  • It doesn't require an additional disk access to get the gfid of each file (it's included in the directory data inside the file).

This is useful only for readdir, not readdirp. Because readdirp also does stat and getxattr for some other xattrs.

  • Performance doesn't degrade as the distribution factor increases.
  • It avoids many small requests that add a lot of latency

this depends on whether the request is readdir or readdirp for the reasons I mentioned before.

I see this change as a performance improvement. Consistency is not so important as long as we are able to detect the inconsistency and regenerate the file.

You can think about this as a cache, not as a source of truth.

pranithk avatar Mar 09 '21 13:03 pranithk

So essentially this metadata file is a database, that needs ACID, fast appends, indexing, journaling(?), compaction, etc.

Yes, it is 'structured' data. Because the format is well defined, and it has to support above properties.

I assume we are going to use an existing implementation?

Using existing is ideal. I don't think someone can afford time for writing another database implementation! But as we haven't started any work on this, open to suggestions :-)

Ceph is using https://rocksdb.org/ - could be an overkill for Gluster's purposes.

@xhernandez - without a DB, I'm not sure how you can recover from partial updates, rollback, etc. Even simple files format need maintenance - from compaction, to atomic updates, etc.

The main point here is that this file won't replace the existing filesystem structure (this would require a more drastic change). All files and directories will still be present in the backend filesystem.

This introduces a big problem:

  • It's really complex to ensure full consistency of both views (filesystem structure and file contents) at all times.

So why to implement a complex thing if we can simple create it like a cache. In this case the implementation is simpler:

  1. Is the cache up to date with the filesystem ? 1.1 Yes: Use the file to get directory contents 1.2 No: Use the filesystem to get directory contents and update the cache

Current implementation already uses locks, so atomic updates are not a problem.

xhernandez avatar Mar 09 '21 13:03 xhernandez

@xhernandez As per my recent understanding of crawling on a disk filesystem, this feature will improve performance of readdirp only when a directory has < 128KB of data per brick which can be responded to the user.

Why ? We can easily fit 4000 entries (it depends on the length of file names, though) in a single 128 KiB disk block. This means that we can serve them to the user with a single backend request, even if it takes more that one request to send them through fuse.

Let's assume that 128KB can be used to serve X files.

What I meant by my previous comment is that this enhancement will show better performance for readdir when the directory has less than dht-subvol-count * X files.

Not really. Even if there are more files, they can be easily read and served from a single subvolume, instead of having to check all subvolumes as it's done now. To read a directory now, you need to open it on all subvolumes (with all the locks and extra overhead needed by other xlators), and cannot read from one subvolume until you have completed the previous one. With a file this is much simpler and requires less operations.

It not only provides a performance improvement, it also provides better consistency when bricks are down, and offsets are more accurate, without corner cases. It's an improvement in many aspects.

readdirp I think is a separate beast because we may have to implement bulk lookup on entries falling on different bricks to reduce the number of network operations to get size of files and other xattrs that maybe necessary.

A pure 'ls' doesn't really need it, but if it's really needed, at worst we will have the same amount of disk requests (network operations are not an important issue when they can be done in parallel instead of sequential).

Note that all these requests can be sent in parallel in the background while the list of files has been returned to the user and it's processing them. This can contribute hiding some of the latency.

  • Directory with lots of directories
  • Directory with very less number of files.

So IMO the main reason to implement this should be consistency when a node goes down.

That's not the main reason to me. What we need to be sure is that we don't return bad/stale data, but if the data gets corrupted or out of date because of a crash, we simply fail back to the filesystem crawl to regenerate it, which will still exist and needs to be maintained anyway.

There are many other benefits using this approach:

  • It never returns partial data in case of a down subvolume (specially on pure distributed volumes).

This is what I meant by consistency. Should have used better term. Consistency is overloaded term :-(.

  • It doesn't require an additional disk access to get the gfid of each file (it's included in the directory data inside the file).

This is useful only for readdir, not readdirp. Because readdirp also does stat and getxattr for some other xattrs.

Yes. That's true, but I think we really don't need so many readdirp. Even if they are needed, a combination of readdir + parallel lookups in the background will probably be much better than sequential readdirp.

  • Performance doesn't degrade as the distribution factor increases.
  • It avoids many small requests that add a lot of latency

this depends on whether the request is readdir or readdirp for the reasons I mentioned before.

readdirp is really a problem, even now. Why we do need it really ? Does it have any advantage over a readdir + parallel lookups ?

I see this change as a performance improvement. Consistency is not so important as long as we are able to detect the inconsistency and regenerate the file. You can think about this as a cache, not as a source of truth.

xhernandez avatar Mar 09 '21 14:03 xhernandez

@xhernandez As per my recent understanding of crawling on a disk filesystem, this feature will improve performance of readdirp only when a directory has < 128KB of data per brick which can be responded to the user.

Why ? We can easily fit 4000 entries (it depends on the length of file names, though) in a single 128 KiB disk block. This means that we can serve them to the user with a single backend request, even if it takes more that one request to send them through fuse.

Let's assume that 128KB can be used to serve X files. What I meant by my previous comment is that this enhancement will show better performance for readdir when the directory has less than dht-subvol-count * X files.

Not really. Even if there are more files, they can be easily read and served from a single subvolume, instead of having to check all subvolumes as it's done now. To read a directory now, you need to open it on all subvolumes (with all the locks and extra overhead needed by other xlators), and

The requests are sent in parallel. So it is not a contributing factor to latency.

cannot read from one subvolume until you have completed the previous one. With a file this is much simpler and requires less operations.

It requires the same number of operations if the data is more than the number I specified earlier. The reason I was stressing on this is, I had different model in my mind until crawl + stat gave very bad performance in practice on our cluster.

  • It made me feel readdirp is bad and should be done only when we know it gives very good performance.
  • This feature while good has limitations in the sense that for directories with lots of files, it may not give lot of benefits.

It not only provides a performance improvement, it also provides better consistency when bricks are down, and offsets are more accurate, without corner cases. It's an improvement in many aspects.

readdirp I think is a separate beast because we may have to implement bulk lookup on entries falling on different bricks to reduce the number of network operations to get size of files and other xattrs that maybe necessary.

A pure 'ls' doesn't really need it, but if it's really needed, at worst we will have the same amount of disk requests (network operations are not an important issue when they can be done in parallel instead of sequential).

Note that all these requests can be sent in parallel in the background while the list of files has been returned to the user and it's processing them. This can contribute hiding some of the latency.

  • Directory with lots of directories
  • Directory with very less number of files.

So IMO the main reason to implement this should be consistency when a node goes down.

That's not the main reason to me. What we need to be sure is that we don't return bad/stale data, but if the data gets corrupted or out of date because of a crash, we simply fail back to the filesystem crawl to regenerate it, which will still exist and needs to be maintained anyway.

There are many other benefits using this approach:

  • It never returns partial data in case of a down subvolume (specially on pure distributed volumes).

This is what I meant by consistency. Should have used better term. Consistency is overloaded term :-(.

  • It doesn't require an additional disk access to get the gfid of each file (it's included in the directory data inside the file).

This is useful only for readdir, not readdirp. Because readdirp also does stat and getxattr for some other xattrs.

Yes. That's true, but I think we really don't need so many readdirp. Even if they are needed, a combination of readdir + parallel lookups in the background will probably be much better than sequential readdirp.

  • Performance doesn't degrade as the distribution factor increases.
  • It avoids many small requests that add a lot of latency

this depends on whether the request is readdir or readdirp for the reasons I mentioned before.

readdirp is really a problem, even now. Why we do need it really ? Does it have any advantage over a readdir + parallel lookups ?

Even I never understood this completely. @amarts Do you remember how much improvement readdirp brought?

I see this change as a performance improvement. Consistency is not so important as long as we are able to detect the inconsistency and regenerate the file. You can think about this as a cache, not as a source of truth.

pranithk avatar Mar 09 '21 14:03 pranithk

readdirp is really a problem, even now. Why we do need it really ? Does it have any advantage over a readdir + parallel lookups ?

Even I never understood this completely. @amarts Do you remember how much improvement readdirp brought?

When we did readdirp(), it was mainly tied to md-cache performance. I don't remember having a strong data on the performance numbers due to readdirp migration!

amarts avatar Mar 09 '21 15:03 amarts

@pranithk @amarts @mohit84 is sure used database to improve performance for dir operation?is any document for this design?

perrynzhou avatar Mar 12 '21 14:03 perrynzhou

Hi,

 For specific to readdir(p) improvement my observation is getdents is fast in usual environment but readdirp  is slow.I believe readdirp is slow because of executing statfs for every dentry and it takes time while stat is not available in cache.I have executed a test case to analyse the behavior.    I have configured 3x1 environemnt on a physical node that have sufficient resources to run the test case.  I have executed smallfile to create 16m(64k) files almost one directory is having total 1M files on all the 3 nodes.   echo 3 > /proc/sys/vm/drop_caches;time ./smallfile_cli.py --operation create --threads 16 --file-size 64 --files 1000000 --top /mnt/test1 --files-per-dir 1000000    To capture the correct data without cache intervene i have unmount the volume and stop/start and cleanup the cache everytim  before executing a test case.  1) Run with default option   time ll | wc -l    1000000

real 3m55.167s user 0m7.021s sys 0m10.586s

 2) Run after enabele metdata-cache group    time ll | wc -l     1000000

real 3m44.559s user 0m6.257s sys 0m8.087s

  1. Run after enable readdir-ahead    time ll | wc -l    1000000

real 2m24.593s user 0m5.533s sys 0m6.301s

  1. Run after disable readdirp completely    Disable performance.force-readdirp and dht.force-readdip and mount a volume    after pass -ouse-readdirp=no    time ll | wc -l 1000000

real 89m28.062s user 0m13.193s sys 0m35.359s

  1. Run only ls     time ls | wc -l 999999

real 0m41.120s user 0m3.087s sys 0m0.645s

As we can see in above output readdir is fast(taking only 41 s) but readdir with lookup is too much slow(taking 89m) because fuse run lookup operation sequentially. After using some internal xlator we can get some improvement but still i don't believe it would match ever with readdirp.As we can see readdirp performance is also slow due to running statfs/xattr fetch call by posix_readdirp.I think there are two reasons while lookup operation is slow when it is run by fuse

  1. It is sequential
  2. Internal xlator sets some xattr like quick-read(GF_CONTENT_KEY)/dht_lookup that will take time   to fetch xattrs at posix layer

I believe even if we go with approach save dentries in some hidden file we would not get much improvement in (ll) operation because readdir is not slow and it will hurt every entry operation performance also because internal xlator has to separate out the entry call's into two subcall's.

I have used one another approach to improve the readdirp performance.

  1. Spawn a background thread at the time of start a brick process, the    job of this thread is to run fstat on a directory on the basis of pfd    ctx has been added to the list.
  2. At the time of calling opendir pfd has been added to the list and    send a signal to the thread
  3. The thread has been started to run fstatat in background so that    it will update kernel cache and set a flag on pfd as it has finished    a job
  4. At the time of reeving a readdirp request(posix_readdirp) wait to finish    the status of the flag(is_finish) if flag is updated it means thread has    finished the job and then fetch dentry and update the stat.As we know  dht run's the operation sequentially so at the time of running readdirp on 1st subvol at the same time other subvol keep updating kernel cache by the thread eventuallyreaddirp operation has become fast on other two subvols.     I  have found below result  
  5. After applied patch with default option    time ll | wc -l 1000000

real 2m18.746s user 0m6.041s sys 0m6.942s

  1. Enable readdir-ahead with patch  time ll | wc -l 1000000

real 1m18.237s user 0m5.570s sys 0m6.680s

As we can see we are seeing almost 100 % improvement in case of readdir-ahead and more than 50 percent improvement in case of default option.I have a plan to use cache also. We can try a posix cache and saves the dentry along with stat data into the same,the cache will be available only the duration between opendir and closedir.I don't know it will improve or not.I will update if i will find any improvement. 

Please share your view on the same.

mohit84 avatar Apr 04 '21 13:04 mohit84

Hi Mohit,

I had done some work on this a long time ago. I remember getting very good performance improvements with readdirp with the following change:

https://review.gluster.org/#/c/glusterfs/+/22366/

I also found that iothreads on the brick was a bottleneck since it was low down in the stack. Moving it above quota and afr in the brick stack helped as well but I those changes were wip when I left Gluster and I no longer have them.

Perhaps you can try it out and see.

Regards, Nithya

nbalacha avatar Apr 05 '21 05:04 nbalacha

@mohit84 I think readdir-ahead has bugs where it doesn't consider all xattrs that the parent xlators are querying for which leads to issues (especially with shard). We probably should fix these issues in case readdir-ahead is giving good results.

The only issue with warming the cache is, it only works when the cache is warm. If we have a big file-read/file-write, the metadata cache may not be warm.

In general readdirp request size is 128KB. If on average if we assume that the length of the dentry name is say 16bytes, then we have 8192 entries for which we have to do stat+getxattrs. 1 thread doing all of them one at a time will lead to higher latency. Why not use multiple threads to do this stat+getxattrs?

pranithk avatar Apr 05 '21 07:04 pranithk

Hi Mohit,

I had done some work on this a long time ago. I remember getting very good performance improvements with readdirp with the following change:

https://review.gluster.org/#/c/glusterfs/+/22366/

I also found that iothreads on the brick was a bottleneck since it was low down in the stack. Moving it above quota and afr in the brick stack helped as well but I those changes were wip when I left Gluster and I no longer have them.

Perhaps you can try it out and see.

Regards, Nithya

Thanks Nithya i will try and update.

mohit84 avatar Apr 05 '21 09:04 mohit84

@mohit84 I think readdir-ahead has bugs where it doesn't consider all xattrs that the parent xlators are querying for which leads to issues (especially with shard). We probably should fix these issues in case readdir-ahead is giving good results.

The only issue with warming the cache is, it only works when the cache is warm. If we have a big file-read/file-write, the metadata cache may not be warm.

In general readdirp request size is 128KB. If on average if we assume that the length of the dentry name is say 16bytes, then we have 8192 entries for which we have to do stat+getxattrs. 1 thread doing all of them one at a time will lead to higher latency. Why not use multiple threads to do this stat+getxattrs?

Yes i agree we have a bug in readdir-ahead, that is a day one bug, we need to fix it.The current bug is readdir-ahead does not invalidate dentry stats if any entry operation happens in that directory otherwise there is no other bug.I think the bug would not impact to test readdir(p) performance.For specific to xattr it fwd xattrs as it comes from upper layer.In case of fuse readdirp buffer size is 4k so by default it fetches almost 20 entries in per operation because per gf_dirent_t object size is almost 208 bytes(size can be increase more based on dentry name).After enabling readdir-ahead buffer size is increase to 12k so posix_readdirp fetches almost 600 entries in one attempt.Even i have to tried to increase almost 2000 in single attempt there is no improvement because slowness is happening while calling stat+getxattr call for dentry.I will try to fetch the same with multiple threads and update if i will find more improvement.

mohit84 avatar Apr 05 '21 09:04 mohit84

  1. Run after disable readdirp completely    Disable performance.force-readdirp and dht.force-readdip and mount a volume    after pass -ouse-readdirp=no    time ll | wc -l 1000000

real 89m28.062s user 0m13.193s sys 0m35.359s

This must be a bug and should be fixed. Even if we assume that all lookups are done sequentially, it takes 5.4 ms per entry. Basically this means the equivalent of a disk access for each lookup. If using readdirp doesn't require a disk access for each entry, then we have something wrong somewhere.

  1. Run only ls     time ls | wc -l 999999

real 0m41.120s user 0m3.087s sys 0m0.645s

41 seconds seem to much to me for a simple directory read without any lookup. This directory could weight some tens of megabytes, which means we are reading it at less than 1 MiB/s. Gluster can read files at hundreds of MiB/s, so this can be improved.

As we can see in above output readdir is fast(taking only 41 s) but readdir with lookup is too much slow(taking 89m) because fuse run lookup operation sequentially.

Kernel fuse does serialize lookups ? or is our own fuse xlator ?

  1. Internal xlator sets some xattr like quick-read(GF_CONTENT_KEY)/dht_lookup that will take time   to fetch xattrs at posix layer

IMO quick-read should be disabled by default.

I believe even if we go with approach save dentries in some hidden file we would not get much improvement in (ll) operation because readdir is not slow and it will hurt every entry operation performance also because internal xlator has to separate out the entry call's into two subcall's.

Without understanding where the latencies come from, I would say it's too speculative to assume that the change won't provide ant benefit (just to start with, Gluster could read the entire directory in less than a second if it were a file, instead of 41 seconds. The lookup issue needs to be fixed though).

I have used one another approach to improve the readdirp performance.

  1. Spawn a background thread at the time of start a brick process, the    job of this thread is to run fstat on a directory on the basis of pfd    ctx has been added to the list.

As Pranith said, creating single worker threads normally becomes a bottleneck sooner or later. Creating a thread pool just for this seems overkill to me.

  1. At the time of calling opendir pfd has been added to the list and    send a signal to the thread
  2. The thread has been started to run fstatat in background so that    it will update kernel cache and set a flag on pfd as it has finished    a job

What if those stats are not really needed, the directory is big and the data is not cached ? it could trigger a lot of disk I/O that will waste IOPS that could be necessary for other operations in progress.

xhernandez avatar Apr 06 '21 09:04 xhernandez

  1. Run after disable readdirp completely    Disable performance.force-readdirp and dht.force-readdip and mount a volume    after pass -ouse-readdirp=no    time ll | wc -l 1000000

real 89m28.062s user 0m13.193s sys 0m35.359s

This must be a bug and should be fixed. Even if we assume that all lookups are done sequentially, it takes 5.4 ms per entry. Basically this means the equivalent of a disk access for each lookup. If using readdirp doesn't require a disk access for each entry, then we have something wrong somewhere.

  1. Run only ls     time ls | wc -l 999999

real 0m41.120s user 0m3.087s sys 0m0.645s

41 seconds seem to much to me for a simple directory read without any lookup. This directory could weight some tens of megabytes, which means we are reading it at less than 1 MiB/s. Gluster can read files at hundreds of MiB/s, so this can be improved.

As we can see in above output readdir is fast(taking only 41 s) but readdir with lookup is too much slow(taking 89m) because fuse run lookup operation sequentially.

Kernel fuse does serialize lookups ? or is our own fuse xlator ?

  1. Internal xlator sets some xattr like quick-read(GF_CONTENT_KEY)/dht_lookup that will take time   to fetch xattrs at posix layer

IMO quick-read should be disabled by default.

I believe even if we go with approach save dentries in some hidden file we would not get much improvement in (ll) operation because readdir is not slow and it will hurt every entry operation performance also because internal xlator has to separate out the entry call's into two subcall's.

Without understanding where the latencies come from, I would say it's too speculative to assume that the change won't provide ant benefit (just to start with, Gluster could read the entire directory in less than a second if it were a file, instead of 41 seconds. The lookup issue needs to be fixed though).

I have used one another approach to improve the readdirp performance.

  1. Spawn a background thread at the time of start a brick process, the    job of this thread is to run fstat on a directory on the basis of pfd    ctx has been added to the list.

As Pranith said, creating single worker threads normally becomes a bottleneck sooner or later. Creating a thread pool just for this seems overkill to me.

  1. At the time of calling opendir pfd has been added to the list and    send a signal to the thread
  2. The thread has been started to run fstatat in background so that    it will update kernel cache and set a flag on pfd as it has finished    a job

What if those stats are not really needed, the directory is big and the data is not cached ? it could trigger a lot of disk I/O that will waste IOPS that could be necessary for other operations in progress.

Hi,

Yes there is a bug in quick_read even quick_read is disabled but it is still trying to read GF_CONTENT_KEY from backend and afaik it takes time.The other difference is in case of readdirp number of xattrs are 6 and in case of default lookup is 11.I will try to fetch the new data after applied the quick_read patch.

p *dict $7 = {max_count = 6, hash_size = 1, count = 6, refcount = {lk = 0x7feae8002b38 "\002", value = 2}, members = 0x7feae8002b88, members_list = 0x7feb0426bcf8, extra_free = 0x0, extra_stdfree = 0x0, lock = { spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, members_internal = 0x7feb0426bcf8, free_pair = { hash_next = 0x0, prev = 0x7feae8003308, next = 0x0, value = 0x7feae80033f8, key = 0x7feae80034a0 "trusted.glusterfs.dht.linkto", key_hash = 29277332}, free_pair_in_use = true} (gdb) p *dict->members_list $8 = {hash_next = 0x7feb040011b8, prev = 0x0, next = 0x7feb040011b8, value = 0x7feb040939d8, key = 0x7feb0417df00 "system.posix_acl_default", key_hash = 3886537676} (gdb) p *dict->members_list->next $9 = {hash_next = 0x7feae80039c8, prev = 0x7feb0426bcf8, next = 0x7feae80039c8, value = 0x7feb04000ed8, key = 0x7feb04104f30 "system.posix_acl_access", key_hash = 2269038943} (gdb) p *dict->members_list->next->next $10 = {hash_next = 0x7feae80037d8, prev = 0x7feb040011b8, next = 0x7feae80037d8, value = 0x7feae8003bb8, key = 0x7feae8003a70 "security.capability", key_hash = 3412778525} (gdb) p *dict->members_list->next->next->next $11 = {hash_next = 0x7feae8003308, prev = 0x7feae80039c8, next = 0x7feae8003308, value = 0x7feae80038d8, key = 0x7feae8002fd0 "security.ima", key_hash = 3630913055} (gdb) p *dict->members_list->next->next->next->next $12 = {hash_next = 0x7feae8002b90, prev = 0x7feae80037d8, next = 0x7feae8002b90, value = 0x7feae80036e8, key = 0x7feae8003690 "user.swift.metadata", key_hash = 3842249446} (gdb) p *dict->members_list->next->next->next->next->next $13 = {hash_next = 0x0, prev = 0x7feae8003308, next = 0x0, value = 0x7feae80033f8, key = 0x7feae80034a0 "trusted.glusterfs.dht.linkto", key_hash = 29277332} (gdb) p *dict->members_list->next->next->next->next->next->next

b posix_lookup p *xdata $3 = {max_count = 11, hash_size = 1, count = 11, refcount = {lk = 0x7feae8000aa8 "\004", value = 4}, members = 0x7feae8000af8, members_list = 0x7feb041fec78, extra_free = 0x0, extra_stdfree = 0x0, lock = { spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, members_internal = 0x7feb041fec78, free_pair = { hash_next = 0x0, prev = 0x7feae8004178, next = 0x0, value = 0x7feb040011b8, key = 0x7feae80047e0 "system.posix_acl_default", key_hash = 3886537676}, free_pair_in_use = true} (gdb) p xdata->members_list $4 = (data_pair_t *) 0x7feb041fec78 (gdb) p *xdata->members_list $5 = {hash_next = 0x7feae8001508, prev = 0x0, next = 0x7feae8001508, value = 0x7feb0426bcf8, key = 0x7feb05542df0 "get-link-count", key_hash = 2908898069} (gdb) p *xdata->members_list->name There is no member named name. (gdb) p *xdata->members_list->next $6 = {hash_next = 0x7feae8003ba8, prev = 0x7feb041fec78, next = 0x7feae8003ba8, value = 0x7feae8003f28, key = 0x7feae8004030 "security.capability", key_hash = 3412778525} (gdb) p *xdata->members_list->next->next $7 = {hash_next = 0x7feae8001bd8, prev = 0x7feae8001508, next = 0x7feae8001bd8, value = 0x7feae8003ca8, key = 0x7feae8003e20 "security.ima", key_hash = 3630913055} (gdb) p *xdata->members_list->next->next->next $8 = {hash_next = 0x7feae80017f8, prev = 0x7feae8003ba8, next = 0x7feae80017f8, value = 0x7feae80019e8, key = 0x7feae8003d50 "user.swift.metadata", key_hash = 3842249446} (gdb) p *xdata->members_list->next->next->next->next $9 = {hash_next = 0x7feae80018e8, prev = 0x7feae8001bd8, next = 0x7feae80018e8, value = 0x7feae8001ad8, key = 0x7feae8002050 "glusterfs.content", key_hash = 3813826931} (gdb) p *xdata->members_list->next->next->next->next->next $10 = {hash_next = 0x7feae8004468, prev = 0x7feae80017f8, next = 0x7feae8004468, value = 0x7feae8004738, key = 0x7feae8001050 "trusted.glusterfs.dht.linkto", key_hash = 29277332} (gdb) p *xdata->members_list->next->next->next->next->next->next $11 = {hash_next = 0x7feae8004648, prev = 0x7feae80018e8, next = 0x7feae8004648, value = 0x7feae8000ba8, key = 0x7feae8004220 "glusterfs.open-fd-count", key_hash = 2899965339} (gdb) p *xdata->members_list->next->next->next->next->next->next->next $12 = {hash_next = 0x7feae8004278, prev = 0x7feae8004468, next = 0x7feae8004278, value = 0x7feae8004558, key = 0x7feae8003c50 "trusted.glusterfs.dht", key_hash = 3516294312} (gdb) p *xdata->members_list->next->next->next->next->next->next->next->next $13 = {hash_next = 0x7feae8004178, prev = 0x7feae8004648, next = 0x7feae8004178, value = 0x7feae8004368, key = 0x7feae8001b80 "trusted.glusterfs.dht.mds", key_hash = 1584253174} (gdb) p *xdata->members_list->next->next->next->next->next->next->next->next->next $14 = {hash_next = 0x7feae8000b00, prev = 0x7feae8004278, next = 0x7feae8000b00, value = 0x7feb04000ed8, key = 0x7feae8001990 "system.posix_acl_access", key_hash = 2269038943} (gdb) p *xdata->members_list->next->next->next->next->next->next->next->next->next->next $15 = {hash_next = 0x0, prev = 0x7feae8004178, next = 0x0, value = 0x7feb040011b8, key = 0x7feae80047e0 "system.posix_acl_default", key_hash = 3886537676}

mohit84 avatar Apr 06 '21 10:04 mohit84

@mohit84 - why is 'user.swift.metadata' enabled?

mykaul avatar Apr 06 '21 11:04 mykaul

@mohit84 - why is 'user.swift.metadata' enabled? The xattr is populated by md-cache because cache-switf-metadata is true in downstream, We need to make it disable in downstream also.
{ .key = {"cache-swift-metadata"}, .type = GF_OPTION_TYPE_BOOL, .default_value = "true", .op_version = {GD_OP_VERSION_3_7_10}, .flags = OPT_FLAG_SETTABLE | OPT_FLAG_CLIENT_OPT | OPT_FLAG_DOC, .description = "Cache swift metadata (user.swift.metadata xattr)", },

mohit84 avatar Apr 06 '21 12:04 mohit84

  1. Run after disable readdirp completely    Disable performance.force-readdirp and dht.force-readdip and mount a volume    after pass -ouse-readdirp=no    time ll | wc -l 1000000

real 89m28.062s user 0m13.193s sys 0m35.359s

This must be a bug and should be fixed. Even if we assume that all lookups are done sequentially, it takes 5.4 ms per entry. Basically this means the equivalent of a disk access for each lookup. If using readdirp doesn't require a disk access for each entry, then we have something wrong somewhere.

  1. Run only ls     time ls | wc -l 999999

real 0m41.120s user 0m3.087s sys 0m0.645s

41 seconds seem to much to me for a simple directory read without any lookup. This directory could weight some tens of megabytes, which means we are reading it at less than 1 MiB/s. Gluster can read files at hundreds of MiB/s, so this can be improved.

As we can see in above output readdir is fast(taking only 41 s) but readdir with lookup is too much slow(taking 89m) because fuse run lookup operation sequentially.

Kernel fuse does serialize lookups ? or is our own fuse xlator ?

  1. Internal xlator sets some xattr like quick-read(GF_CONTENT_KEY)/dht_lookup that will take time   to fetch xattrs at posix layer

IMO quick-read should be disabled by default.

I believe even if we go with approach save dentries in some hidden file we would not get much improvement in (ll) operation because readdir is not slow and it will hurt every entry operation performance also because internal xlator has to separate out the entry call's into two subcall's.

Without understanding where the latencies come from, I would say it's too speculative to assume that the change won't provide ant benefit (just to start with, Gluster could read the entire directory in less than a second if it were a file, instead of 41 seconds. The lookup issue needs to be fixed though).

I have used one another approach to improve the readdirp performance.

  1. Spawn a background thread at the time of start a brick process, the    job of this thread is to run fstat on a directory on the basis of pfd    ctx has been added to the list.

As Pranith said, creating single worker threads normally becomes a bottleneck sooner or later. Creating a thread pool just for this seems overkill to me.

  1. At the time of calling opendir pfd has been added to the list and    send a signal to the thread
  2. The thread has been started to run fstatat in background so that    it will update kernel cache and set a flag on pfd as it has finished    a job

What if those stats are not really needed, the directory is big and the data is not cached ? it could trigger a lot of disk I/O that will waste IOPS that could be necessary for other operations in progress.

After disable quick-read we are almost 30 percent improvement but still it is not matching similar to readdirp.I believe posix_lookup is taking more time because it has build a path based on gfid and in case of posix_readdirp the path has been built on the basis of dentry_name and directory_handle_path.It has executed statfs call several time for real_path and parent so i believe it takes more time and posix_readdirp call sys_lstat only once. I need to debug it more why are we getting this huge difference between readdirp and readdir+lookup.

time ll | wc -l 1000000

real 66m40.648s user 0m11.266s sys 0m25.767s

By default quick_read is enabled that;s why xattr(GF_CONTENT_KEY) has been set during lookup operation. I think either we need to disable quick_read xlator or we should set default size of max-file-size to 0.

.key = "performance.quick-read", .voltype = "performance/quick-read", .option = "!perf", .value = "on", .op_version = 1, .description = "enable/disable quick-read translator in the volume.", .flags = VOLOPT_FLAG_CLIENT_OPT | VOLOPT_FLAG_XLATOR_OPT},

mohit84 avatar Apr 06 '21 15:04 mohit84

Hi,

After disable quick_read and changed the approach to build a path in posix_lookup i am able to reduce lookup latency significantly. As we can see after applied below code in posix_lookup i am able to finish lookup operation for all the files(10M) around 17m that is significantly lower than earlier(89M).I think if we do send parallel lookup by some internal client xlator then may be we can achieve similar number as we are getting with readdirp.

time ll | wc -l 1000000

real 17m0.074s user 0m8.199s sys 0m17.360s

if (!gf_uuid_is_null(loc->pargfid)) {
       len = posix_handle_path(this, loc->pargfid, NULL, NULL, 0);
        hpath = alloca(len + 256); /* NAME_MAX */
        ppath = alloca(len);
        ret = posix_handle_path(this, loc->pargfid, NULL, hpath, len);
        if (ret > 0) {
            strcpy(ppath, hpath); 
            len = strlen(hpath);
            hpath[len] = '/';
            strcpy(&hpath[len + 1], loc->name);
            ret = posix_pstat(this, loc->inode, loc->inode->gfid, hpath, &buf, _gf_false);
            if (ret == 0) {
                posix_update_iatt_buf(&buf, -1, hpath, xdata);
                if (xdata) {
                    xattr = posix_xattr_fill(this, hpath, loc, NULL, -1, xdata, &buf);     
                }
            } else {
                gf_log("MY_LOG", GF_LOG_INFO, "posix_pstat is failed for hpath %s", hpath);
            }
            ret = posix_pstat(this, NULL, loc->pargfid, ppath, &postparent, _gf_false);
            if (ret == 0) {
                posix_update_iatt_buf(&postparent, -1, ppath, xdata);
            } else {
               gf_log("MY_LOG", GF_LOG_INFO, "posix_pstat is failed for parent path %s", ppath);
            }
            op_ret = 0;
            goto unwind;
        }
   }

Below is the latency for first bricks test1-server.latency.LOOKUP=697499.718,333052,232303676158.000 test1-index.latency.LOOKUP=682463.196,333052,227295732358.000 test1-marker.latency.LOOKUP=680682.331,333052,226702611659.000 test1-io-threads.latency.LOOKUP=678250.241,333052,225892599430.000 test1-upcall.latency.LOOKUP=669342.403,333052,222925826046.000 test1-locks.latency.LOOKUP=668476.988,333052,222637597893.000 test1-access-control.latency.LOOKUP=663046.185,333052,220828857942.000 test1-bitrot-stub.latency.LOOKUP=658932.556,333052,219458805564.000 test1-changetimerecorder.latency.LOOKUP=657647.045,333052,219030663681.000 test1-posix.latency.LOOKUP=656590.249,333052,218678695712.000

Thanks, Mohit Agrawal

mohit84 avatar Apr 07 '21 06:04 mohit84

This line:

posix_update_iatt_buf(&postparent, -1, ppath, xdata);

Is just because of CloudSync feature, right? I find it very inefficient we take this small hit for every lookup on every file, when this feature is rarely used. And if we don't need it, do we need this one:

ret = posix_pstat(this, NULL, loc->pargfid, ppath, &postparent, _gf_false);

?

mykaul avatar Apr 07 '21 08:04 mykaul

This line:

posix_update_iatt_buf(&postparent, -1, ppath, xdata);

Is just because of CloudSync feature, right? I find it very inefficient we take this small hit for every lookup on every file, when this feature is rarely used. And if we don't need it, do we need this one:

ret = posix_pstat(this, NULL, loc->pargfid, ppath, &postparent, _gf_false);

?

Yes stat of parent is required , dht/afr is the main consumer of this.The function posix_update_iatt_buf is only specific to cloudsync feature.

mohit84 avatar Apr 07 '21 09:04 mohit84

I have found the reason why readdirp vs readdir+lookup does not matching. At the time of doing readdirp it build the lookup path on the basis of entry found in a directory and it always traverse in a single directory but while we do run lookup operation first it populate a real_path and then validate a real_path has link with .glusterfs or not.To validate the same it has to lookup in .glusterfs on different directory based on gfid so number of lookup operation will be increase and we are getting performance reduction while we do call readdir + posix_lookup.I have observed posix_handle_hard is the main culprit.After just skip the verify_handle part in posix_gfid_set if size_of_gfid == 16 the lookup operation of total files(10M) are finished in 9M.I think we can achieve similar to readdirp if we do run lookup parallel by some internal xlator in case of readdir.

time ll | wc -l 1000000

real 9m35.101s user 0m8.805s sys 0m21.734s

I think the question is gfid validation required during lookup operation? For every entry operation we do call posix_gfid_set at the end so it is not possible the link does not exist with .glusterfs.The other option is we can run gfid validation part in background at the time of lookup.

mohit84 avatar Apr 07 '21 13:04 mohit84

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] avatar Dec 26 '21 14:12 stale[bot]

Worth revisiting?

mykaul avatar Dec 26 '21 15:12 mykaul