Confusing behavior of no-wait exclude when a process is not reporting to the cluster
When running the exclude no_wait command with a process that is not reporting to the database, the CLI reports a message of the form: WARNING: Missing from cluster! Be sure that you excluded the correct processes before removing them from the cluster!. It reports the same message whether the address is completely unknown to the database, or whether it is a process that is associated with data that has not been fully re-replicated. This means that we cannot use the output of exclude no_wait to determine if the re-replication for that process has completed. This makes it difficult to determine if it is safe to permanently destroy resources associated with a process that is temporarily unavailable. By comparison, the blocking form of the exclude command will block when a process is in this state, until the data is replicated. I think we should change this behavior to give a clearer signal on processes that are missing but have data, and align the no-wait exclude and the blocking exclude more.
What's your preferred way to get the signal out? Is it some text or error code from the command line?
I already added a special key range (\xff\xff/management/in_progress_exclusion/, \xff\xff/management/in_progress_exclusion0) to tell what processes are in progress of excluding, which means the data replication is not finished.
It's trivial to add a new fdbcli interface for this like
excludeInProgress
to print out any processes not finished yet. Is this something helpful here?
That special key-range will be available in 7.0 (I only see it in the release-7.0 branch and not the release-6.3)? Would it make more sense to read it directly from the database instead of adding a new fdbcli command?
That special key-range will be available in 7.0 (I only see it in the
release-7.0branch and not therelease-6.3)? Would it make more sense to read it directly from the database instead of adding a newfdbclicommand?
yeah, it's only available on 7.0 yeah, we can directly read it. Adding a command for that is just making it easier to remember(or maybe print more help text) if someone cannot remember the key range.