glusterfs icon indicating copy to clipboard operation
glusterfs copied to clipboard

dht: hash-mod vs hash-jump-consistent results

Open pranithk opened this issue 4 years ago • 20 comments

Just want to share results of some experiments I did: I added the attached patch and this is what I observed for the amount of data that is migrated with each addition of subvolume till 10 subvolumes:

In the root directory added 10000 files with 1kb size to do the test:

Data migrated Hash-mod based migration: Migrated 4929 / 10000 files [49.29%] - 1 -> 2 Migrated 3228 / 10000 files [32.28%] - 2 -> 3 Migrated 3324 / 10000 files [33.24%] - 3 -> 4 Migrated 3473 / 10000 files [34.73%] - 4 -> 5 Migrated 2927 / 10000 files [29.27%] - 5 -> 6 Migrated 2818 / 10000 files [28.18%] - 6 -> 7 Migrated 3181 / 10000 files [31.81%] - 7 -> 8 Migrated 3880 / 10000 files [38.80%] - 8 -> 9 Migrated 3819 / 10000 files [38.19%] - 9 -> 10

Jump consistent hash based migration: Migrated 4947 / 10000 files [49.47%] - 1 -> 2 Migrated 3304 / 10000 files [33.04%] - 2 -> 3 Migrated 2528 / 10000 files [25.28%] - 3 -> 4 Migrated 1992 / 10000 files [19.92%] - 4 -> 5 Migrated 1621 / 10000 files [16.21%] - 5 -> 6 Migrated 1448 / 10000 files [14.48%] - 6 -> 7 Migrated 1211 / 10000 files [12.11%] - 7 -> 8 Migrated 1112 / 10000 files [11.12%] - 8 -> 9 Migrated 1022 / 10000 files [10.22%] - 9 -> 10

@xhernandez @amarts @BarakSason @mohit84 @tshacked @csabahenk

What do you guys think? Shall we get this also along with global layout? Attaching just enough diff to get this experiment going. (~30 lines)

jump-consistent.txt

EDIT: Added number of dht subvolume changes per migration

pranithk avatar Oct 24 '20 13:10 pranithk

Want to clarify a bit here. Jump consistent hash works really well for volumes that are expanding. For shrinking, one should always remove-brick from the last subvolume for better performance.

pranithk avatar Oct 24 '20 13:10 pranithk

While it is definitely better than the existing layout management in DHT, we should still try to come up with a solution that avoids the full filesystem scan.

pranithk avatar Oct 24 '20 15:10 pranithk

It really seems a very interesting way to distribute data.

While it is definitely better than the existing layout management in DHT, we should still try to come up with a solution that avoids the full filesystem scan.

The function used for Jump Consistent Hash is reversible. This means that if we have N nodes and we are given the last b (which basically maps to the node number that contains the file), we can compute the range of values of key that would generate N in the next iteration (meaning that the corresponding files would be migrated to the new node).

Basically:

         (b + 1) * 2^64
key >= ------------------ - 2^33
                N

        (b + 1) * 2^64
key < ------------------ - 2^33
             N + 1

To take advantage of this we would need to keep an index based on key. Since each node knows its own b, each node could compute and process the list of files to be moved independently.

I'm not sure if this could have a big impact in performance for creates and unlinks. An alternative is to scan the brick locally (much faster than scanning through the volume) and decide which files need to be moved. It doesn't prevent the scan, though it's much faster, but it doesn't need to keep an index either.

If the hash came from gfid instead of name, I think it would be possible to reuse existing .glusterfs structure to implicitly keep the index (though with some changes in naming). I need to think a bit more about this.

I've seen there are other consistent hashing methods that could have interesting properties, like Maglev hashing. I'll try to spend some more time with these methods.

xhernandez avatar Oct 24 '20 22:10 xhernandez

Want to clarify a bit here. Jump consistent hash works really well for volumes that are expanding. For shrinking, one should always remove-brick from the last subvolume for better performance.

In case of arbitrary node removal the performance will be similar to existing hash, right.

mohit84 avatar Oct 25 '20 03:10 mohit84

Want to clarify a bit here. Jump consistent hash works really well for volumes that are expanding. For shrinking, one should always remove-brick from the last subvolume for better performance.

In case of arbitrary node removal the performance will be similar to existing hash, right.

This is not something that is excessively tested. We have to do our own testing to find this out.

pranithk avatar Oct 25 '20 07:10 pranithk

It really seems a very interesting way to distribute data.

While it is definitely better than the existing layout management in DHT, we should still try to come up with a solution that avoids the full filesystem scan.

The function used for Jump Consistent Hash is reversible. This means that if we have N nodes and we are given the last b (which basically maps to the node number that contains the file), we can compute the range of values of key that would generate N in the next iteration (meaning that the corresponding files would be migrated to the new node).

Basically:

         (b + 1) * 2^64
key >= ------------------ - 2^33
                N

        (b + 1) * 2^64
key < ------------------ - 2^33
             N + 1

To take advantage of this we would need to keep an index based on key. Since each node knows its own b, each node could compute and process the list of files to be moved independently.

I'm not sure if this could have a big impact in performance for creates and unlinks. An alternative is to scan the brick locally (much faster than scanning through the volume) and decide which files need to be moved. It doesn't prevent the scan, though it's much faster, but it doesn't need to keep an index either.

If the hash came from gfid instead of name, I think it would be possible to reuse existing .glusterfs structure to implicitly keep the index (though with some changes in naming). I need to think a bit more about this.

I've seen there are other consistent hashing methods that could have interesting properties, like Maglev hashing. I'll try to spend some more time with these methods.

Cool. Will be eagerly waiting for it. If you want me to research some methods, testing etc. Update here with what you want me to experiment with and I can get you the data.

pranithk avatar Oct 25 '20 07:10 pranithk

@pranithk First of all, these numbers look very good, and everything that reduces the number of files that are being migrated is welcomed. In the last couple of days I've been testing things that are related to the layout as well, and I got somewhat different numbers. I performed similar test to yours (expanding a 2 node volume to 3, 10,000 files on root) and the result was: Edit: attached as doc because of formatting rebalance_results.txt So I'm actually seeing much more files being migrated than you... any explanation?

As some of the previous comments mentioned GL, I'd like to make some clarifications regarding GL and the enhancements I'm working at in rebalance: 1 - As we've discussed, we have 2 problems with rebalance: 1.1 - Inner migration of file between existing nodes. 1.2 - Inefficiencies in both FS traversal and file migration. Both of these problems originates from the constraints of the per-dir layout model (the need to perform a fix layout operation on all the dirs). 2 - In order to solve 1.1, bucket-model needs to be designed & implemented and when nodes are added (or removed) bucked will only move to (or from) these node. In order to solve 1.2, I've already began work on https://github.com/gluster/glusterfs/issues/1685, https://github.com/gluster/glusterfs/issues/1684 and next up in line is https://github.com/gluster/glusterfs/issues/1660. @xhernandez regarding your comment on scanning the brick locally instead of through the volume, I'm planning to perform this as part of https://github.com/gluster/glusterfs/issues/1660 (there was another issue I opened recently - https://github.com/gluster/glusterfs/issues/1618 but closed it as https://github.com/gluster/glusterfs/issues/1660 contains this effort and is broader in it's scope). In order to fully complete this effort, substantial work is needed on the client side as well (implement GL on the client side), I haven't created issues for these yet. 3 - This is the important one - This work will NOT prevent the FS traversal (it will greatly optimize the process, but a FS scan will still be required). If a design that can eliminate the FS scan can be suggested, this is of course the optimal solution (my thinking was that once we progress with GL and the bucket model, maybe we could come up with some caching mechanism to keep track of which files belong to each bucket, thus eliminating the need to scan the FS, but at this point I'm not able to offer any concrete design for this).

BarakSason avatar Oct 25 '20 13:10 BarakSason

Want to clarify a bit here. Jump consistent hash works really well for volumes that are expanding. For shrinking, one should always remove-brick from the last subvolume for better performance.

After thinking a bit on this, I don't find a perfect solution, but we could allow removal of intermediate subvolumes using the combination of two operations:

  1. Remove the last subvolume
  2. Replace the original subvolume to be removed with the newly removed subvolume.

This adds the cost of the second step, but it's still more efficient than our current approach. All other subvolumes remain unaffected (they may receive some files only).

xhernandez avatar Oct 26 '20 08:10 xhernandez

I'm not sure if this could have a big impact in performance for creates and unlinks. An alternative is to scan the brick locally (much faster than scanning through the volume) and decide which files need to be moved. It doesn't prevent the scan, though it's much faster, but it doesn't need to keep an index either.

I was under the impression that dht does hash-mod, but it looks like it finds the range where the hash would fit in linearly in the layout. Anyways, I did a benchmark between this method, dht's search for range method. It prints the amount of time it would take to compute hash and figure out the subvolume for each method for 1 million files and 1000 subvolumes. Here are the results. Jump consistent hash is ~7X faster

  107748281ns Hash mod Duration
  171156118ns Jump Duration
 1278900379ns Dht Duration

github doesn't support uploading C files. So I changed extension to .txt

jump-dht-compare.txt

pranithk avatar Oct 27 '20 05:10 pranithk

@xhernandez I may get some time to do some more work in this regard for other hash functions you may want me to try out. Let me know. I will be happy to do the same work and provide comparisons.

pranithk avatar Oct 27 '20 06:10 pranithk

@xhernandez I may get some time to do some more work in this regard for other hash functions you may want me to try out. Let me know. I will be happy to do the same work and provide comparisons.

I'm not sure if this could have a big impact in performance for creates and unlinks. An alternative is to scan the brick locally (much faster than scanning through the volume) and decide which files need to be moved. It doesn't prevent the scan, though it's much faster, but it doesn't need to keep an index either.

I was under the impression that dht does hash-mod, but it looks like it finds the range where the hash would fit in linearly in the layout. Anyways, I did a benchmark between this method, dht's search for range method. It prints the amount of time it would take to compute hash and figure out the subvolume for each method for 1 million files and 1000 subvolumes. Here are the results. Jump consistent hash is ~7X faster

  107748281ns Hash mod Duration
  171156118ns Jump Duration
 1278900379ns Dht Duration

github doesn't support uploading C files. So I changed extension to .txt

jump-dht-compare.txt

The result is impressive.

mohit84 avatar Oct 27 '20 06:10 mohit84

@pranithk First of all, these numbers look very good, and everything that reduces the number of files that are being migrated is welcomed. In the last couple of days I've been testing things that are related to the layout as well, and I got somewhat different numbers. I performed similar test to yours (expanding a 2 node volume to 3, 10,000 files on root) and the result was: Edit: attached as doc because of formatting rebalance_results.txt So I'm actually seeing much more files being migrated than you... any explanation?

No idea. Do you have a script which can recreate these results?

As some of the previous comments mentioned GL, I'd like to make some clarifications regarding GL and the enhancements I'm working at in rebalance: 1 - As we've discussed, we have 2 problems with rebalance: 1.1 - Inner migration of file between existing nodes. 1.2 - Inefficiencies in both FS traversal and file migration. Both of these problems originates from the constraints of the per-dir layout model (the need to perform a fix layout operation on all the dirs). 2 - In order to solve 1.1, bucket-model needs to be designed & implemented and when nodes are added (or removed) bucked will only move to (or from) these node. In order to solve 1.2, I've already began work on #1685, #1684 and next up in line is #1660. @xhernandez regarding your comment on scanning the brick locally instead of through the volume, I'm planning to perform this as part of #1660 (there was another issue I opened recently - #1618 but closed it as #1660 contains this effort and is broader in it's scope). In order to fully complete this effort, substantial work is needed on the client side as well (implement GL on the client side), I haven't created issues for these yet. 3 - This is the important one - This work will NOT prevent the FS traversal (it will greatly optimize the process, but a FS scan will still be required). If a design that can eliminate the FS scan can be suggested, this is of course the optimal solution (my thinking was that once we progress with GL and the bucket model, maybe we could come up with some caching mechanism to keep track of which files belong to each bucket, thus eliminating the need to scan the FS, but at this point I'm not able to offer any concrete design for this).

At the moment I am mostly doing exploration to see what would best solve problem 1.1 you mentioned. Nothing is final yet. But I do want us to have good data to pick something better than the current mechanism.

pranithk avatar Oct 27 '20 06:10 pranithk

Looking at the code (specifically, at dht_hash_compute() function), it is running a regexp under LOCK(&priv->lock) - which means it probably cannot process files as fast as you'd expect. I have no idea why a lock is needed. I assume to ensure priv->extra_regex_valid, priv->extra_regex, priv->rsync_regex_valid, priv->rsync_regex do not change in between? Doesn't make sense to me. I don't know if there is contention on that lock - but I'm sure there's a better way to do it.

(not to mention that there's a gf_msg_trace() under that lock)

mykaul avatar Oct 27 '20 07:10 mykaul

@pranithk First of all, these numbers look very good, and everything that reduces the number of files that are being migrated is welcomed. In the last couple of days I've been testing things that are related to the layout as well, and I got somewhat different numbers. I performed similar test to yours (expanding a 2 node volume to 3, 10,000 files on root) and the result was: Edit: attached as doc because of formatting rebalance_results.txt So I'm actually seeing much more files being migrated than you... any explanation?

No idea. Do you have a script which can recreate these results?

Simplest way possible - 2-node distributed volume, on mountpoint touch {1..10000} expand to 3 nodes, trigger rebalance. I've raised https://github.com/gluster/glusterfs/issues/1710 as a possible cause, but this still doesn't explain why do we see different results. Could you take a look at this issue?

At the moment I am mostly doing exploration to see what would best solve problem 1.1 you mentioned. Nothing is final yet. But I do want us to have good data to pick something better than the current mechanism.

I assume the bucket model should solve this? It'll solve the inner migration problem, however FS scan will still be needed. If we're talking about this, in the bucket model, where would the bucket xatts will be saved? I assume since the number of buckets is high, it can't be kept as an xattr, similar to the current hash-range. Any suggestions?

BarakSason avatar Oct 27 '20 07:10 BarakSason

@pranithk First of all, these numbers look very good, and everything that reduces the number of files that are being migrated is welcomed. In the last couple of days I've been testing things that are related to the layout as well, and I got somewhat different numbers. I performed similar test to yours (expanding a 2 node volume to 3, 10,000 files on root) and the result was: Edit: attached as doc because of formatting rebalance_results.txt So I'm actually seeing much more files being migrated than you... any explanation?

No idea. Do you have a script which can recreate these results?

Simplest way possible - On mountpoint touch {1..10000}. I've raised #1710 as a possible cause, but this still doesn't explain why do we see different results. Could you take a look at this issue?

Okay. I will do this when I get a chance. May take a while.

At the moment I am mostly doing exploration to see what would best solve problem 1.1 you mentioned. Nothing is final yet. But I do want us to have good data to pick something better than the current mechanism.

I assume the bucket model should solve this? It'll solve the inner migration problem, however FS scan will still be needed. If we're talking about this, in the bucket model, where would the bucket xatts will be saved? I assume since the number of buckets is high, it can't be kept as an xattr, similar to the current hash-range. Any suggestions?

Didn't think too much about this part yet. First wanted to find out which hash algo is good for our case. Once that is finalized, thought we can think about the buckets if necessary.

pranithk avatar Oct 27 '20 07:10 pranithk

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] avatar May 30 '21 13:05 stale[bot]

I think this is still an interesting idea.

xhernandez avatar May 31 '21 06:05 xhernandez

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] avatar Dec 27 '21 14:12 stale[bot]

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

stale[bot] avatar Jan 23 '22 12:01 stale[bot]

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] avatar Sep 17 '23 06:09 stale[bot]