glusterfs Optimize index xlator code path to improve performance for internal/external client

During running of smallfile for data creation I have observed we have a significant bottleneck at index xlator level, especially for XATTROP fop. As we know xattrop is an internal fop and used by AFR/EC to retain consistency across the cluster nodes. After doing so much testing we have found there is a scope of improvement at index xlator to handle gfid index correctly. After creating some dummy files in indices directories(dirty,xattrop,entry-changes) we found performance improvement later on after digging it more Xavi has found it is because of xfs behavior.

The XFS uses different on-disk formats to store the directories in the most efficient way. When a directory is small enough, the whole directory is stored inside the inode itself but when it grows beyond the available size in the inode it uses a new filesystem block to store the contents. As the size keeps growing XFS provides at least 2 additional formats, including a b+ tree-based structure for very big directories.

In general, the performance difference between these formats is very similar but switching between them is not a cheap operation. In gluster the index xlator constantly creates and removes entries from index directories for every xattrop operation.In the case of smallfile the performance was improved significantly as we did create some dummy files in the indices directory or use random gfid index to do link/unlink operation by index xlator to avoid swithcing between different formats.

We believe instead of a having directory indices (dirty,xattrop,entry-changes) to store the index, we can use a index file to keep the gfids to completely avoid the directory format switch cost and save the gfids in an internal hash.

We have decided the below approach should work

Do index insert/delete operation in an internal data structure(hash) and save the gfids in a file while index refcount is 1 and 0. During the brick start, the index xlator reads the index file and as it has found the first entry it inserts the key into the data structure and in the next entry delete the entry from the data structure so ideally, the indexfile behavior is similar to a journal. I have tried the approach only for dirty index, the performance is similar to earlier test case (dummy file or random gfids)

The other thing I observed during readdir shd fetch at max 517 entries(because of current gf_direent_size, per entry size is 260 bytes and max readdir buffer limit is 128kb so it fetch almost 517 entries in one attempt) and after checking the code i have found shd does not needs other gf_dirent data structure contents, IIUC to heal the content shd needs two things, gfid, and type of gfid(file/directory) so I believe during index_readdir we can save the key(gfid+type) in dictionary and in client4_readdir_cbk we can prepare a list of entries from the dictionary, I think we can handle 4000 entries in one shot instead of handling 517.

@pranithk @amarts @itisravi Before implement the same for others (xattrop/entry-changes) I want to know your view, do you see any issue if we go with this appraoch.

Jul 26 '22 11:07 mohit84

Didn't understand this part, could you give an example?

Jul 26 '22 14:07 pranithk

@mohit84 I did not understand the XFS performance penalty thing. Are you saying that XFS keeps switching between storing the dentries in the inode and storing them in a separate extents as and when the count decreases and increases dynamically (due to heal count fluctuations)? I would assume that this switching happens in the background in some kernel thread. Which FOP did the switching impact? Was it just index_readdir()? How much (approx) improvements did you observe in your simulated tests?

One thing I am doubtful of is what would be the real benefit of fetching multiple entries in one go. The number of readdir requests to xattrop folders will reduce but the main job of shd is to launch heal on all those entries. multi-threaded selfheal of even a few 10s of files load the bricks, so I wonder how we can benefit from fetching all 4000 entries in one go since we can't possibly launch heal on all of them in parallel.

Jul 26 '22 16:07 itisravi

We will create 3 index hash tables during xlator init, during index_add it will save key(gfid+type) in the table if it does not exist and at the same time the same key is appended in the backend file also. During index_del it will check if a key exists in the table if it exists, delete a key from the table and append the key in the backend file also. During index_del it will check if there is no key exists in the table it means all indexes are processed truncate the file size to 0. We will save two entries per index in the backend file one for insert and next for deletion so that in case of crash during init when xlator reads the file it will be able to make a decision what index was already healed before the crash a brick process. If two entry exists for any index it means the index was already processed successfully.

Jul 27 '22 00:07 mohit84

Yes We believe the same, the switching does not happen in the background. During index_add we do call link syscall and index_del we do call unlink so both are impacted. For the tests, I observed almost 10-20% performance improvement in smallfile case sometime the improvement is more than 50% also. The performance is improved while one of the brick is down

The initial level of data is available below link, https://github.com/gluster/glusterfs/issues/3606

Xavi executed a test program that just creates and removes entries 1 million times:

Running it with 3 pre-existing entries, it finished in 18.9 seconds Running it with 4 pre-existing entries, it finished in 162.4 seconds Running it with 5 pre-existing entries, it finished in 20.6 seconds

As we can see the performance affect based on the number of pre-existing entries for the same test case. For specific to fetch more dentries during readdir i am not sure about the improvement. It will save only network operations not sure how much it will contribute to performance. The other thing is we will save one lookup call that we are calling during index_readdir if we do save file type during index creation.

Jul 27 '22 00:07 mohit84

Below is the perf result after execute a small test case on a single node (1x3)

echo 3 > /proc/sys/vm/drop_caches;/root/smallfile/smallfile_cli.py --operation create --threads 8 --file-size 64 --files 10000 --top /mnt/test;echo 3 > /proc/sys/vm/drop_caches;/root/smallfile/smallfile_cli.py --operation cleanup --threads 8 --file-size 64 --files 10000 --top /mnt/test;echo 3 > /proc/sys/vm/drop_caches;/root/smallfile/smallfile_cli.py --operation create --threads 8 --file-size 64 --files 10000 --top /mnt/test;echo 3 > /proc/sys/vm/drop_caches;/root/smallfile/smallfile_cli.py --operation cleanup --threads 8 --file-size 64 --files 10000 --top /mnt/test;echo 3 > /proc/sys/vm/drop_caches;/root/smallfile/smallfile_cli.py --operation create --threads 8 --file-size 64 --files 10000 --top /mnt/test;echo 3 > /proc/sys/vm/drop_caches;/root/smallfile/smallfile_cli.py --operation cleanup --threads 8 --file-size 64 --files 10000 --top /mnt/test;echo 3 > /proc/sys/vm/drop_caches;/root/smallfile/smallfile_cli.py --operation create --threads 8 --file-size 64 --files 10000 --top /mnt/test;echo 3 > /proc/sys/vm/drop_caches;/root/smallfile/smallfile_cli.py --operation cleanup --threads 8 --file-size 64 --files 10000 --top /mnt/test

1st attempt

total threads = 8 total files = 80000 total IOPS = 736 total data = 4.883 GiB 100.00% of requested files processed, warning threshold is 70.00 elapsed time = 109.044 files/sec = 736.379147 IOPS = 736.379147 MiB/sec = 46.023697

2nd attempt total threads = 8 total files = 80000 total IOPS = 1045 total data = 4.883 GiB 100.00% of requested files processed, warning threshold is 70.00 elapsed time = 77.089 files/sec = 1045.178611 IOPS = 1045.178611 MiB/sec = 65.323663

3rd attempt total threads = 8 total files = 80000 total IOPS = 1165 total data = 4.883 GiB 100.00% of requested files processed, warning threshold is 70.00 elapsed time = 69.347 files/sec = 1165.495900 IOPS = 1165.495900 MiB/sec = 72.843494

4th attempt

total threads = 8 total files = 80000 total IOPS = 1272 total data = 4.883 GiB 100.00% of requested files processed, warning threshold is 70.00 elapsed time = 63.297 files/sec = 1272.986833 IOPS = 1272.986833 MiB/sec = 79.561677

After kill 3rd brick

total threads = 8 total files = 80000 total IOPS = 894 total data = 4.883 GiB 100.00% of requested files processed, warning threshold is 70.00 elapsed time = 90.191 files/sec = 894.136451 IOPS = 894.136451 MiB/sec = 55.883528

Then delete and cleanup the volume partition and create a new 1x3 volume

Created dummy index files on all the 3 bricks to avoid switching and run the test case again

for i in {1..10}; do uuid=uuidgen; touch /gluster{1..3}/node1/test1/.glusterfs/indices/entry-changes/myfile-$uuid; done for i in {1..10}; do uuid=uuidgen; touch /gluster{1..3}/node1/test1/.glusterfs/indices/dirty/myfile-$uuid; done for i in {1..10}; do uuid=uuidgen; touch /gluster{1..3}/node1/test1/.glusterfs/indices/xattrop/myfile-$uuid; done

1st attempt

total threads = 8 total files = 80000 total IOPS = 1478 total data = 4.883 GiB 100.00% of requested files processed, warning threshold is 70.00 elapsed time = 54.116 files/sec = 1478.556423 IOPS = 1478.556423 MiB/sec = 92.409776

2nd attempt

total threads = 8 total files = 80000 total IOPS = 1727 total data = 4.883 GiB 100.00% of requested files processed, warning threshold is 70.00 elapsed time = 46.546 files/sec = 1727.541317 IOPS = 1727.541317 MiB/sec = 107.971332

3rd attempt

total threads = 8 total files = 80000 total IOPS = 1756 total data = 4.883 GiB 100.00% of requested files processed, warning threshold is 70.00 elapsed time = 46.213 files/sec = 1756.136125 IOPS = 1756.136125 MiB/sec = 109.758508

4th attempt total threads = 8 total files = 80000 total IOPS = 1743 total data = 4.883 GiB 100.00% of requested files processed, warning threshold is 70.00 elapsed time = 46.361 files/sec = 1743.263299 IOPS = 1743.263299 MiB/sec = 108.953956

After kill 3rd brick total threads = 8 total files = 80000 total IOPS = 1764 total data = 4.883 GiB 100.00% of requested files processed, warning threshold is 70.00 elapsed time = 46.066 files/sec = 1764.399610 IOPS = 1764.399610 MiB/sec = 110.274976

As we can see after creating dummy index file and performance is also improved.

Jul 27 '22 02:07 mohit84

Won't this lead to OOM at some point because of the number of entries in the in-memory hash?

Jul 31 '22 23:07 pranithk

Both XFS and ZFS have peculiar behavior w.r.t. directories. On ZFS when the directory becomes big because of pending entries, once heal completes, directory size doesn't shrink to previous levels. Even though there are zero entries in that big directory readdir will still be slow. Now we find that XFS also has optimizations due to which index is performing poorly.

If we want to give uniform performance, the metadata needs to be maintained in some sort of b-tree by ourselves. If we don't have time to implement that solution, we should probably create some dummy files that index will ignore for XFS. Move to new directory automatically for ZFS after heal completes.

Jul 31 '22 23:07 pranithk

For xattrop/dirty index it consumes 25 bytes per entry(17bytes key + 8 bytes pointer) so for 256M it will save almost 10M entries. If we do consider 1G per indices directory it will be almost 40M entries which is sufficient. In case of entry-changes it will take consume more memory, it depends on the entry name if we do consider per entry is taking almost 256 bytes it will save 1M entries in 256M memory and 4M entries in 1G. The other way is we can use the current approach for entry-changes if you believe the limit can be reached easily. If we consider for dirty/xattrop still we will get good improvement.

Aug 01 '22 01:08 mohit84

As per my understanding, if we have brick multiplexing with 64 bricks per machine(Which is what I remember to be the case in the worst case scenario) this method proves to be a costly. Also in our deployment it is ~70-80M files per brick. Is using so much RAM necessary to solve this problem?

Aug 01 '22 04:08 pranithk

We have not considered the solution for brick_mux because i am still not sure the brick_mux feature is used by anyone except the container environment. I agree in the case of hash it increases memory usage but at the same time it will improve performance also. There are some other advantages also if we go with the hash index approach, implementing the readdir operation will be easy. We do need to read index per slot wise and we can use parent gfid also so that all file indexes will be saved in the same slot, in the case of bplustree we have to start from the root for every readdir attempt the solution will be complex. The other thing I am not sure what would be the performance impact in both cases. In the case of hashing, we are following sequential file writing and the file is deleted only while table_size is 0 and file size has reached the threshold limit, In the case of reading no disk operation. In the case of bplustree it will be block-based write/read in the case of read also we have to do disk read operation, not sure about the performance difference. I will try to run a test case to generate 100M entries for bplustree (need to find out some library)/disk based hash index and see the number. We can make it feature configurable and a user can take the decision to use hash based disk index vs the current approach. Personally, I free nowadays memory is not a big concern for users, even for 100M it will not consume more than 2.6G.

Aug 01 '22 05:08 mohit84

There is one important thing to consider here:

The data needs to be present on disk to be able to recover from crashes in all cases without risking losing the marks that tell self-heal that a file has been modified.

This is true for the existing solution or for any solution we can implement.

The first implication of this is that we need to force data to be on disk before allowing any operation on the related inode. Otherwise an inode could be modified on disk but the "mark" to know it could be lost in a crash. This implies the utilization of O_SYNC or fsync.

The current implementation doesn't use it, so it's not 100% safe.

Using fsync can be a serious performance issue, but any new solution we consider should include it.

Using a b-tree is a good theoretical approach, but updating it in a safe way is not easy, specially if we want to do it efficiently to avoid performance issues. It can be done (databases do that), but trying to implement the safety guarantees of a database ourselves I think it's out of scope.

We could also use an existing database, of course, but this seems overkill to me, and gives us little control about how and when things are updated (specially the use of fsync).

Using a sequential file access to track updates is very fast. The fsync cost is still present, but it shouldn't hurt so much since we avoid many seeks that would increase the latency of fsyncs.

As you mentioned, the drawback of this approach is that the data stored on disk is not structured and needs to be tracked in memory, which is an important factor to consider.

In this particular case the information we need to keep is very small: just the gfid, type and a pointer to insert it in the hash table. We can round it to 32 bytes per entry. We only need to track modified entries, not the entire volume, so in the worst case we could consider that 1 million files is modified each day.

I think having brick-mux is irrelevant here. If we have brick-mux enabled, all the memory will be consumed by a single process. Without brick-mux, each process will consume a portion of the memory, but the total consumption should be roughly the same. So we just consider how many bricks are present in a server, independently if they are using brick-mux or not. Let's assume we have 64 bricks per server.

This means that we would need 64 * 1,000,000 * 32 ~ 2 GiB of memory to store all the problematic files modified in one day. That's a lot, sure, but a machine able to run 64 bricks should have much more memory than that, like 64 or even 128 GiB at the minimum. So 2 GiB is not a critical value IMO. Even one week of accumulated data would require 14 GiB (that's a worst case because probably not all files modified each day will be different from the other days). It's a lot more memory but still not a critical value (btw, a failed disk or brick should be repaired much before a week).

I don't see the memory utilization as a critical factor here. It's true that the memory requirements during brick failures can be significant, but they shouldn't cause serious issues in the server (like OOM kills), at most they will reduce the amount of cached data, so it could cause a drop in performance, but modifying a big directory (like we are doing now) or updating a database won't be cheap operations either.

On the other side, the sequential access pattern is probably one of the easiest to implement and the one with less performance loss. And since we can implement it as we like, we could even execute the fsync in the background, avoiding extra latency in the I/O path (for example by running the fsync in the background, but synchronize any future update on the inode until the fsync has finished. In many cases the future fops will arrive after the fsync has completed, so performance shouldn't suffer). It could also be the first step towards a journal-based solution that would solve many other problems.

That said, it's true that there are some potential corner cases where the brick could be down for extended periods of time, or the number of modified files per day could be huge. We should also consider them. For those cases I think we could implement a second structure when the hash table grows too much (similar to what XFS does with the directory structure, but with a much higher threshold). This second structure could be a b-tree if we want to.

In this case, the most common cases would be very fast, but when the memory utilization grows beyond a threshold, we can switch to an hybrid disk/memory structure to save memory. It will be slower, but it will keep memory utilization under control in all cases. We can also provide high and low thresholds to avoid continuous structure migrations if we are running just around the threshold, like it happens with XFS.

An alternative could be to use a "disk-friendly" cache structure so that after reaching a given memory consumption (potentially configurable), part of the structure could be simply flushed to disk, without actually needing to implement two different structures for both cases.

Sep 12 '22 08:09 xhernandez

I am working on bplustree, as the patch will be available i will upload it.

Sep 23 '22 11:09 mohit84

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

May 21 '23 16:05 stale[bot]

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

Jun 11 '23 09:06 stale[bot]

glusterfs glusterfs copied to clipboard

Optimize index xlator code path to improve performance for internal/external client

glusterfs
glusterfs copied to clipboard