zookeeper icon indicating copy to clipboard operation
zookeeper copied to clipboard

ZOOKEEPER-4425: `snap` 4lw command enabling on demand snapshots

Open pfcoperez opened this issue 3 years ago • 8 comments

This PR contains the patch corresponding to the JIRA issue https://issues.apache.org/jira/browse/ZOOKEEPER-4425

It aims to:

  • Make the last in-memory copy of a node database to be accessible to 4lw commands, even after a disconnection event happens and the references to the connected Zookeeper server become null.
  • Use this internal feature to provide a snap for letter words command that makes the receiving ZooKeeper instance to save its local copy of the database to disk as any regular snapshot.

Example of usage:

echo "snap" | nc localhost 2184

pfcoperez avatar Dec 14 '21 16:12 pfcoperez

Thanks for the patch @pfcoperez !! A few generic comments:

Huge thanks @symat for taking the time to review the patch.

* this really helps only if you have a very specific problem, when you deleted all the snapshots and transaction log files by accident on all the ZooKeeper servers. I don't know if we should really support this edge case. (see the discussion on the Jira item)

I came with one example in the Jira issue but there are surely other situations where I find this useful. However, I'd perfectly understand if you disregarded this feature. Myself and other workmates would find it quite handy.

I've tested trying to restore trees using production data (quite big trees containing several GBs of data) snapshots, without translog files and I've not found problem in restoring at least partial data so far. I know it is a matter of chance but leaning probability to success certain disastrous situations is a huge step.

* I sense a potential vulnerability here. Taking a snapshot is heavy on the disk I/O. Also these files can be quite large. One could kill the node by sending "snap" commands with telnet in a loop.

On one hand, I had this though myself and that's why I think this should never be part of the default whitelisted commands. On the other, I would be quite worried about ZK installations with serving ports exposed to the public.

* since ZooKeeper 3.5 we have two diagnostic APIs: the old telnet 4-letter-words (we want to deprecate later I think) and the new http AdminAPI. In general we should make sure the new commands end up in both interfaces. (assuming we don't think this is a security risk... I'm not sure if we have any authentication on the REST admin API, and I would rather not let anyone creating snapshots without some authentication)

If you folks think this feature has any chance of eventually getting merged, I'd be delighted to add the REST API counterpart of this command. In fact I might do that anyway soon.

pfcoperez avatar Dec 16 '21 10:12 pfcoperez

I've added the HTTP admin API counterpart to the original 4lw snap command.

pfcoperez avatar Dec 21 '21 10:12 pfcoperez

@symat I've noticed there is not such a thing as a whitelist for the HTTP admin commands. I still think this admin port should not be exposed. Yet, I find it easy to implement some kind of exponential back-off to rate-limit the snap requests.

pfcoperez avatar Dec 22 '21 22:12 pfcoperez

Thanks a lot @symat ! I will do as requested.

I have a branch including exponential back-off for the HTTP version of the command and in fact that logic can be abstracted out as a decorator around any other command that might need it. Maybe you folks would be interested on that as a follow-up contribution too.

the new command will not be accessible through the 4-letter-words interface, unless it is explicitly whitelisted

That's intentional from my part I think it is a good idea to have it that way.

pfcoperez avatar Jan 06 '22 18:01 pfcoperez

Can you please update the doc page for this command?

eolivelli avatar Jan 07 '22 08:01 eolivelli

I have a concern about this change - tbh whether the feature should be supported at all. If you notice none of the existing 4lw allow changes to the "persistent" state of the service. This is on purpose. 4lw have no security - eg no authz. 4lw have been a source of security issues in the. past - in particular DOS attacks. eg https://issues.apache.org/jira/browse/ZOOKEEPER-2693 see also https://issues.apache.org/jira/issues/?jql=project%20%3D%20ZOOKEEPER%20AND%20text%20~%20%224lw%20dos%22 Please reconsider this feature, at least in the current form. Thx.

phunt avatar Jan 07 '22 23:01 phunt

I have a concern about this change - tbh whether the feature should be supported at all. If you notice none of the existing 4lw allow changes to the "persistent" state of the service. This is on purpose. 4lw have no security - eg no authz. 4lw have been a source of security issues in the. past - in particular DOS attacks. eg https://issues.apache.org/jira/browse/ZOOKEEPER-2693 see also https://issues.apache.org/jira/issues/?jql=project%20%3D%20ZOOKEEPER%20AND%20text%20~%20%224lw%20dos%22 Please reconsider this feature, at least in the current form. Thx.

Thanks a lot @phunt , would you be of the same opinion if the command was throttled or failed to executed when invoked in periods shorter than autopurge.purgeInterval ?

I've already an not-pushed commit that adds exponential backoff to this command. I am willing to refine (basically abstract out the backoff logic and extend it to the HTTP version of the coammnd) and push it if I know that'd change your opinion. I think simpler solutions, like just waiting for a fix interval like autopurge.purgeInterval or just allowing the execution of this command when the connection to the ensemble is lost, would work.

pfcoperez avatar Jan 14 '22 16:01 pfcoperez

@symat @phunt I am willing to work on adding some authentication capabilities to the AdminServer API then we could limit this feature to just allowing the command to be run through that API.

I've tried to find an open issue for that task without success: https://issues.apache.org/jira/browse/ZOOKEEPER-2693?jql=project%20%3D%20ZOOKEEPER%20AND%20text%20~%20%22authentication%20admin%22

Would you mind pointing me to the existing issue? If you confirm it doesn't exist, I'll file it on Jira.

Thx!

pfcoperez avatar Feb 04 '22 11:02 pfcoperez

@eolivelli Do you know if there is any tracking issue on authenticated HTTP admin API? I'd like to add authentication if there isn't any and this way we could unlock this PR.

pfcoperez avatar Oct 24 '22 09:10 pfcoperez