zookeeper
zookeeper copied to clipboard
ZOOKEEPER-4425: `snap` 4lw command enabling on demand snapshots
This PR contains the patch corresponding to the JIRA issue https://issues.apache.org/jira/browse/ZOOKEEPER-4425
It aims to:
- Make the last in-memory copy of a node database to be accessible to 4lw commands, even after a disconnection event happens and the references to the connected Zookeeper server become
null
. - Use this internal feature to provide a
snap
for letter words command that makes the receiving ZooKeeper instance to save its local copy of the database to disk as any regular snapshot.
Example of usage:
echo "snap" | nc localhost 2184
Thanks for the patch @pfcoperez !! A few generic comments:
Huge thanks @symat for taking the time to review the patch.
* this really helps only if you have a very specific problem, when you deleted all the snapshots and transaction log files by accident on all the ZooKeeper servers. I don't know if we should really support this edge case. (see the discussion on the Jira item)
I came with one example in the Jira issue but there are surely other situations where I find this useful. However, I'd perfectly understand if you disregarded this feature. Myself and other workmates would find it quite handy.
I've tested trying to restore trees using production data (quite big trees containing several GBs of data) snapshots, without translog files and I've not found problem in restoring at least partial data so far. I know it is a matter of chance but leaning probability to success certain disastrous situations is a huge step.
* I sense a potential vulnerability here. Taking a snapshot is heavy on the disk I/O. Also these files can be quite large. One could kill the node by sending "snap" commands with telnet in a loop.
On one hand, I had this though myself and that's why I think this should never be part of the default whitelisted commands. On the other, I would be quite worried about ZK installations with serving ports exposed to the public.
* since ZooKeeper 3.5 we have two diagnostic APIs: the old telnet 4-letter-words (we want to deprecate later I think) and the new http AdminAPI. In general we should make sure the new commands end up in both interfaces. (assuming we don't think this is a security risk... I'm not sure if we have any authentication on the REST admin API, and I would rather not let anyone creating snapshots without some authentication)
If you folks think this feature has any chance of eventually getting merged, I'd be delighted to add the REST API counterpart of this command. In fact I might do that anyway soon.
I've added the HTTP admin API counterpart to the original 4lw snap
command.
@symat I've noticed there is not such a thing as a whitelist for the HTTP admin commands. I still think this admin port should not be exposed. Yet, I find it easy to implement some kind of exponential back-off to rate-limit the snap
requests.
Thanks a lot @symat ! I will do as requested.
I have a branch including exponential back-off for the HTTP version of the command and in fact that logic can be abstracted out as a decorator around any other command that might need it. Maybe you folks would be interested on that as a follow-up contribution too.
the new command will not be accessible through the 4-letter-words interface, unless it is explicitly whitelisted
That's intentional from my part I think it is a good idea to have it that way.
Can you please update the doc page for this command?
I have a concern about this change - tbh whether the feature should be supported at all. If you notice none of the existing 4lw allow changes to the "persistent" state of the service. This is on purpose. 4lw have no security - eg no authz. 4lw have been a source of security issues in the. past - in particular DOS attacks. eg https://issues.apache.org/jira/browse/ZOOKEEPER-2693 see also https://issues.apache.org/jira/issues/?jql=project%20%3D%20ZOOKEEPER%20AND%20text%20~%20%224lw%20dos%22 Please reconsider this feature, at least in the current form. Thx.
I have a concern about this change - tbh whether the feature should be supported at all. If you notice none of the existing 4lw allow changes to the "persistent" state of the service. This is on purpose. 4lw have no security - eg no authz. 4lw have been a source of security issues in the. past - in particular DOS attacks. eg https://issues.apache.org/jira/browse/ZOOKEEPER-2693 see also https://issues.apache.org/jira/issues/?jql=project%20%3D%20ZOOKEEPER%20AND%20text%20~%20%224lw%20dos%22 Please reconsider this feature, at least in the current form. Thx.
Thanks a lot @phunt , would you be of the same opinion if the command was throttled or failed to executed when invoked in periods shorter than autopurge.purgeInterval
?
I've already an not-pushed commit that adds exponential backoff to this command. I am willing to refine (basically abstract out the backoff logic and extend it to the HTTP version of the coammnd) and push it if I know that'd change your opinion. I think simpler solutions, like just waiting for a fix interval like autopurge.purgeInterval
or just allowing the execution of this command when the connection to the ensemble is lost, would work.
@symat @phunt I am willing to work on adding some authentication capabilities to the AdminServer API then we could limit this feature to just allowing the command to be run through that API.
I've tried to find an open issue for that task without success: https://issues.apache.org/jira/browse/ZOOKEEPER-2693?jql=project%20%3D%20ZOOKEEPER%20AND%20text%20~%20%22authentication%20admin%22
Would you mind pointing me to the existing issue? If you confirm it doesn't exist, I'll file it on Jira.
Thx!
@eolivelli Do you know if there is any tracking issue on authenticated HTTP admin API? I'd like to add authentication if there isn't any and this way we could unlock this PR.