skip_unavailable changes from true to false when remote connection fails
Elasticsearch Version
8.12.2
Installed Plugins
No response
Java Version
bundled
OS Version
linux
Problem Description
There is a behavior in Elasticsearch where the skip_unavailable setting for a remote cluster connection is automatically reset to false when an incorrect remote cluster address is configured. After correcting the connection details, the skip_unavailable setting does not revert to true, even if it was previously set to that value. Instead, it requires an explicit reconfiguration to set it back to true.
Steps to Reproduce
- Configure a remote cluster with skip_unavailable set to true:
PUT _cluster/settings
{
"persistent": {
"cluster.remote.ccs.mode": "proxy",
"cluster.remote.ccs.proxy_address": "ccs.es.us-central1.gcp.cloud.es.io:9400",
"cluster.remote.ccs.proxy_socket_connections": "18",
"cluster.remote.ccs.server_name": "ccs.es.us-central1.gcp.cloud.es.ioo",
"cluster.remote.ccs.skip_unavailable": "true"
}
}
-
Verify the configuration, note that
skip_unavailableistrue. -
Introduce an error by setting an incorrect remote cluster address:
PUT _cluster/settings
{
"persistent": {
"cluster.remote.ccs.proxy_address": "ccs-broken.es.us-central1.gcp.cloud.es.io:9400"
}
}
- Observe that the remote connection fails and
skip_unavailableis automatically set tofalse.
{
"ccs": {
"connected": false,
"mode": "proxy",
"proxy_address": "ccs-broken.es.us-central1.gcp.cloud.es.io:9400",
"server_name": "ccs-broken.es.us-central1.gcp.cloud.es.ioo",
"num_proxy_sockets_connected": 0,
"max_proxy_socket_connections": 18,
"initial_connect_timeout": "30s",
"skip_unavailable": false
}
}
-
Correct the server address back to the initial correct value.
-
Notice that
skip_unavailableremainsfalseand does not revert back totrue.
{
"ccs": {
"connected": true,
"mode": "proxy",
"proxy_address": "ccs.es.us-central1.gcp.cloud.es.io:9400",
"server_name": "ccs.es.us-central1.gcp.cloud.es.ioo",
"num_proxy_sockets_connected": 18,
"max_proxy_socket_connections": 18,
"initial_connect_timeout": "30s",
"skip_unavailable": false
}
}
- Manually attempt to set
skip_unavailableto true again:
PUT _cluster/settings
{
"persistent": {
"cluster.remote.ccs.skip_unavailable": "true"
}
}
-
Observe how
skip_unavailabledoes not change totrueand remains set tofalse. -
Set
skip_unavailabletofalsewhile it is already set to afalsevalue.
PUT _cluster/settings
{
"persistent": {
"cluster.remote.ccs.skip_unavailable": "false"
}
}
- Manually attempt to set
skip_unavailabletotrue.
PUT _cluster/settings
{
"persistent": {
"cluster.remote.ccs.skip_unavailable": "true"
}
}
- The setting now updates successfully, verify that the remote connection works and
skip_unavailableis set back totrue.
Logs (if relevant)
No response
Pinging @elastic/es-distributed (Team:Distributed)
I believe the problem described above should be fixed by #105792. This PR changes default behaviour for skip_unavailable to true. It does not address steps 10 to 14 where skip_unavailable has to be set false and then true, which seems to be a different issue.
Original problem statement should be resolved now in 8.15, can you confirm please, @asmith-elastic?
@mhl-b thanks for checking! While the mentioned PR will change the default value to true, we want to be sure that the issue described here won't change again the value to false in case the remote connection fails. If that happens and goes unnoticed, the users will now have skip_unavailable set to false in the remote clusters that failed, which is not the right default experience and why we are introducing the changes in 8.15.
@naj-h the PR I attached reproduces the steps in the description and demonstrates that the problem no longer exists in the current codebase. Are you satisfied that we can close this issue?
@nicktindall Thanks much for your tests! If this issue is not reproduced in main, then I think we can close this out.