curator CURATOR-688. SharedCount will be never updated successful when version of ZNode is overflow.

I met one issue which will never update znode value successfully when integer overflow (-2147483648) of znode data version using curator to invoke SharedCount#trySetCount(VersionedValue<Integer>, int). After dig the limitation logic and found that here could be the root cause.

https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/shared/SharedValue.java#L196-L209

My environment is, curator version: 2.10.0, zookeeper version: 3.4.6

Ref - CURATOR-688

Aug 31 '23 07:08 Hexiaoqiao

Sorry without new unit test here. Because I want to init one znode with version Integer.MAX_VALUE but no interface found. It will be helpful if someone give some guide to cover this case.

Aug 31 '23 07:08 Hexiaoqiao

Some conclusion, a. zookeeper server side only add 1 for version when update znode. And it allows integer overflow. b. curator side can't be compatible with integer overflow because it compares old version and new version here and not consider the negative version number. https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/shared/SharedValue.java#L196-L209 c. The issue will be triggered from user interface both SharedCount#trySetCount and SharedCount#start(). When znode version meet Integer.MIN_VALUE SharedCount will be never update because condition current.getVersion() >= version always true. d. While add condition current.getVersion() >= version && version != Integer.MIN_VALUE, this case could be resolved. And update znode as expected.

Aug 31 '23 09:08 Hexiaoqiao

One corner case didn't considered, and try to update this PR and trigger CI again.

Aug 31 '23 12:08 Hexiaoqiao

fix checkstyle and trigger ci again.

Sep 01 '23 02:09 Hexiaoqiao

Hi @eolivelli @kezhuw , anymore suggestions here?

Sep 01 '23 10:09 Hexiaoqiao

@Hexiaoqiao Sorry for the delay. I am stuck about and also fearing this overflow things. There are not simple bugs, there are limitations. Hard to fix, only fragile workaround for best wish. I have some chaos thoughts for this issue:

Curator guarantees no ordering-ness about VersionedValue.getVersion. That is good. Better to document this out of order issue or make it private in future.
The promise about "up-to-date" should be guaranteed.
Silent "never update" is not good which is the goal for us to fix here.

I am ok to a workaround but I am also think exception(e.g. let caller know they hit an implementation limitation) is a good fallback except for background watcher. Though, I am not positive to a good workaround. I think there are few choices when encountering this hard limitation.

Push foward underlying implementation. In this case, a check_zxid version of setData in ZooKeeper.
Rewrites something in caller side. In this case, either curator side or your application.
Workaround possibly fragile.

All these suggestions happened in https://lists.apache.org/thread/4o3rl49rdj5y0134df922zgc8clyt86s. cc @li4wang

Besides above, did you find this in production ? What is your use case ? @Hexiaoqiao

Sep 02 '23 14:09 kezhuw

@kezhuw Thanks for your detailed comments. I totally agree that this issue is not very easy/simple to fix perfectly. But for my case, this improvement could fix it when try to reproduce.

Besides above, did you find this in production ? What is your use case ?

Of course YES. The corresponding code snippet as following shows.

https://github.com/apache/hadoop/blob/a6c2526c6c7ccfc4483f58695aedf0cb892b4c4d/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/token/delegation/ZKDelegationTokenSecretManager.java#L506-L516

I would like to give a brief explanation (which is dependent by Hadoop project). A. For HDFS RBF architecture[1], all Routers share the same states which is managed by zookeeper, and Token is one of these states. B. About token[2], one Router generate token using increment sequence number shared by all Routers and update corresponding znode value (/zkdtsm/ZKDTSMRoot/ZKDTSMSeqNumRoot), let's name it Z. C. Where token number will be million/day or more, and the version of znode Z will overflow only years after create(maybe less than three years if 5 million/day tokens generated.)

At first, I want to fix it at Hadoop side, but it is not smoothy and could not fix the root cause.

For this PR, my first thought is fix at curator side first, then try to improve it at both zookeeper and curator side as the solution you mentioned above. Any thoughts? Thanks.

[1] https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs-rbf/HDFSRouterFederation.html [2] http://hortonworks.com/wp-content/uploads/2011/08/adding_security_to_apache_hadoop.pdf

Sep 03 '23 15:09 Hexiaoqiao

@Hexiaoqiao Thanks your sharing, that is important!

I found some potential problem in the usages, though I haven't take a deep in look. incrSharedCount uses batch which is what Ted Dunning suggested in above mail thread. That is good. But it is somewhat error-prone in implementation.

The counter itself is a 32-bit integer, so it will overflow finally regard less of Stat.version.
The default batchSize is small. I got two defaults. public static final int DEFAULT_SEQ_NUM_BATCH_SIZE = 10; and public static final int ZK_DTSM_TOKEN_SEQNUM_BATCH_SIZE_DEFAULT = 1;.

So, for your use case, I suggest:

Rewrites sequence number as 64-bit integer. I am not sure what the consequence of this. But 32-bit in your case will overflow finally. It could overflowed already hiding by overflow of Stat.version.
Uses a sensible default batch size and refuse small batch size. This is easy in your case, but you will get negative sequence number cause of above.

Stat.version is 32-bit, it is not good to do a 1 to 1 mapping to application sequence number.

Sep 04 '23 00:09 kezhuw

@kezhuw Great response. Some concerns here.

Rewrites sequence number as 64-bit integer. I am not sure what the consequence of this. But 32-bit in your case will overflow finally. It could overflowed already hiding by overflow of Stat.version.

Uses a sensible default batch size and refuse small batch size. This is easy in your case, but you will get negative sequence number cause of above.

Actually sequence number can be compatible with 32-bit integer overflow, which only should be distinguishable number without other conditions. IMO it is OK even if overflow, actually I try to test it on our test env and it works fine. So it is not hiding issue here. Another side, I totally agree that set the default configuration ZK_DTSM_TOKEN_SEQNUM_BATCH_SIZE_DEFAULT to some bigger number (such as 10 or much more) could bypass this issue, however it does not solve root cause about Curator. Right? Thanks again.

Sep 04 '23 02:09 Hexiaoqiao

IMO it is OK even if overflow, actually I try to test it on our test env and it works fine. So it is not hiding issue here.

@Hexiaoqiao Glad to hear, that is good!

however it does not solve root cause about Curator. Right?

I am open to a workaround as long as it fit. For a workaround to work, I think we should guarantee:

The promise about "up-to-date" should be guaranteed.
The promise about "Changes the shared value only if its value has not changed since the version specified by newValue" should be hold. This means we have to throw exception if previous.getVersion() equals to -1 in call chain of trySetCount.

I feel hopeless about viable workaround, but not against one. Anyway please go ahead, I would be happy to hear good news.

Sep 04 '23 06:09 kezhuw

Thanks @kezhuw for your suggestions and sorry for the late response.

The promise about "up-to-date" should be guaranteed. The promise about "Changes the shared value only if its value has not changed since the version specified by newValue" should be hold. This means we have to throw exception if previous.getVersion() equals to -1 in call chain of trySetCount.

I am not sure if I understood this point clearly. 'we have to throw exception if previous.getVersion() equals to -1 in call chain of trySetCount', Will it never update successfully when dataversion is back to -1 if we need to guarantee this rule?

Sep 13 '23 13:09 Hexiaoqiao

'we have to throw exception if previous.getVersion() equals to -1 in call chain of trySetCount', Will it never update successfully when dataversion is back to -1 if we need to guarantee this rule?

Yes, and no. When previous.getVersion reach to -1, trySetCount has few choices:

Exception to caller for limitation of the implementation, that is Stat.version overflowed and wraparound to -1 and setData with -1 version is a blind update.
Keep silent and do a blind update.

The later behavior is a bug due to the contract what trySetCount try to express. As a library, I don't think Curator should do that for callers silently.

For the "no" part, callers can risk themselves by doing a blind update through setCount.

Sep 13 '23 15:09 kezhuw

The later behavior is a bug due to the contract what trySetCount try to express. As a library, I don't think Curator should do that for callers silently.

+1, Agree. So how about expose the choice to end user and offer some configuration items to set? IMO, some sensitive data should not be updated blindly, but some case we should offer solution to bypass this bug, such as Token for Hadoop as mentioned above, which can update smoothly even blindly. What do you think about? TBH, I am not familiar with implement about Curator, so would you mind to list some demo code snippet about configuration if this solution could be approved.

Sep 14 '23 03:09 Hexiaoqiao

TBH, I am not familiar with implement about Curator, so would you mind to list some demo code snippet about configuration if this solution could be approved.

Somehow you can copy the shared count implementation..It's not quite a lot of code.

Since this requirement is quite customized, I'm afraid that it's not suitable to hack into Curator. The root cause is integer overflow that is backed by ZK. (check_zxid may be complicated and ZK's zxid also suffers from integer overflow while that is a leader election issue; due to Jute's limitation, we cannot extend version from int to long or add a new field while keep compatibility.)

Also, I don't understand actually how this patch "fixes" the issue. You add more strict condition to exit but the original issue is hang?

Sep 15 '23 04:09 tisonkun

So how about expose the choice to end user and offer some configuration items to set? IMO, some sensitive data should not be updated blindly, but some case we should offer solution to bypass this bug

I am -0 to this approach. Exception is the default, in anyway. Clients can bypass themselves easily on exception if they encounter or aware of this. I don't think it is worth for Curator to do that for this limitation(either ZooKeepr or Curator implementation from perspective) in case of awareness from clients. The important things for Curator here from my side are documentation and exception in code. Anything beyond that are probably overkill.

Sep 15 '23 06:09 kezhuw

I understand the condition now and agree that this patch should work.

While it can be better for ZK to jump from version -2 to 0 so that in an overflow loop we don't handle -1 hole, the restart and get versioned value = -1 remains and set data concurrently doesn't hurt as long as we read the updated data finally.

Sep 15 '23 08:09 tisonkun

reproduce the problem

@eolivelli The tricky point here is that ZK doesn't expose API to manually edit node version so you should change the data for Int.MAX times which can be time-consuming..

Sep 15 '23 08:09 tisonkun

While it can be better for ZK to jump from version -2 to 0 so that in an overflow loop we don't handle -1 hole, the restart and get versioned value = -1 remains and set data concurrently doesn't hurt as long as we read the updated data finally.

Good idea! I just saw ZOOKEEPER-4743.

Sep 15 '23 13:09 kezhuw

@tisonkun Glad to see you here.

While it can be better for ZK to jump from version -2 to 0 so that in an overflow loop we don't handle -1 hole, the restart and get versioned value = -1 remains and set data concurrently doesn't hurt as long as we read the updated data finally.

It will be more smooth while ZK to skip version -1, Strong +1. From my first thought, it would be the next step after we fix at Curator side. Now it is great to see that other guys already try to push this forward. For ZOOKEEPER-4743, I just suggest that we should consider to solve dataVersion/aclVersion/cversion 32-integer overflow at once. Back to this PR, what should we do before ZOOKEEPRER-4743 check in, just wait ZOOKEEPER-4743 to be ready or other ways? Thanks.

Sep 18 '23 02:09 Hexiaoqiao

what should we do before ZOOKEEPRER-4743 check in, just wait ZOOKEEPER-4743 to be ready or other ways?

I suggest we use Stat.mzxid to guard ordering in updateValue. (I posted in another thread https://github.com/apache/curator/pull/478/files#r1327946265).

The current approach does not sound correct to me(https://github.com/apache/curator/pull/478/files#r1313839760). And I think it is error-prone.

When Stat.version is -1, exception is enough for us to go. ZOOKEEPRER-4743 is another story, Curator should work irrespective of ZooKeeper version.

Sep 19 '23 13:09 kezhuw

Back to this PR, what should we do before ZOOKEEPRER-4743 check in, just wait ZOOKEEPER-4743 to be ready or other ways? Thanks.

They are irrelevant things.

For this patch, I may give another look as well as the comments above. @Hexiaoqiao you can try to address @kezhuw's comments/

Sep 20 '23 17:09 tisonkun

Hi @Hexiaoqiao, do you still work on this ? Or should I take over this ? I plan to resort to Stat.mzxid in updateValue.

Sep 27 '23 09:09 kezhuw

Sorry for the late response since I took a long vacation. Please feel free to take over it if interested.

Oct 09 '23 02:10 Hexiaoqiao

Thank you @Hexiaoqiao ! I have pushed a commit to using Stat.mzxid. Would you mind to take a look ?

Oct 12 '23 04:10 kezhuw

@Hexiaoqiao @kezhuw @eolivelli I'll review this patch in this week. I'd like to try include this patch in 5.7.0.

Jun 03 '24 10:06 tisonkun