foundationdb
foundationdb copied to clipboard
Foundationdb loses all the data after renaming datacenter_id and setting up regions
Problem statement
If fdbserver datacenter_id differs from the one was when the database was created, setting up regions causes deleting all data from the foundationdb cluster.
Steps to reproduce
- Set up a three-node Primary FDB cluster with
datacenter_id = dummy
parameter - Create the database with ``configure new ssd double```
- Restart the all fdbserver processes in the cluster with
datacenter_id = dc1
parameter - Make a simple region description in the regions.json file
{
"regions": [
{"datacenters": [{"id": "dc1", "priority": 6}]},
{"datacenters": [{"id": "dc2", "priority": -1}]}
]
}
- Configure regions in the database with fdbcli:
fileconfigure FORCE regions.json
The FORCE is necessary because there is no any fdbserver processes in dc2 yet - Create and start another three fdb nodes with
datacenter_id = dc2
parameter - Try to start replication from dc1 to dc2 with fdbcli:
configure usable_regions:=2
Expected result
The database should remain available. The data should start replicating from dc1 to dc2
Actual result
WARNING: Long delay (Ctrl-C to interrupt)
The database is unavailable; type `status' for more information.
Findings
After trying to change usable_regions
all the storage files of all fdbservers in dc1 get deleted and do not more exist.
Despite restarting fdbserver processes with the new datacenter_id = dc1, the old datacenter_id is still present in the system space of the database and reserves the locality tag 0.
fdb> getrangekeys \xff/tagLocalityList \xff/tagLocalityList0
Range limited to 25 keys
`\xff/tagLocalityList/\x01\x03\x00\x00\x00dummy'
fdb> get \xff/tagLocalityList/\x01\x03\x00\x00\x00dummy
`\xff/tagLocalityList/\x01\x03\x00\x00\x00dummy' is `\x01\x00\x01b\xb0\x00\xdb\x0f\x00'
The new datacenters_id s dynamically got the next locality tags 1 and 2 (recruitEverything->newTLogServers in fdbserver/masterserver.actor.cpp)
- storageServerCore receives dbInfoChange
- storageServerCore calls TagPartitionedLogSystem::peekSingle
- TagPartitionedLogSystem::peekSingle calls TagPartitionedLogSystem::peekLocal
- TagPartitionedLogSystem::peekLocal scans tLogs for a log with the locality 0 (got from the storage tag). But there are only logs with locality 1 and 2 and there are no logs with locality 0.
- In this case TagPartitionedLogSystem::peekLocal traces event TLogPeekLocalNoBestSet and throws worker_removed()
- the storageServerCore catches this exception worker_removed and calls storageServerTerminated
- storageServerTerminated removes all data files, that causes the database unavailability.
Proposal
- The best solution would be to automatically manage the locality dictionary in the system space and add the new datacenter_id with the same locality tag.
- Another acceptable solution is to add checks to configure and fileconfigure that all localities registered in \xff/tagLocalityList present in the region configuration
- Maybe add check on starting storageserver that it's datacenter_id is registered in the \xff/tagLocalityList and refuse to start with a clear error message
- The minimal is to mention in the documentation that changing datacenter_is is strongly forbidden after database has been created
Any update of this issue? Is it a bug, feature or incorrect use of fdb? We experienced the same situation when we want to convert a fdb cluster to multiply region configuration.
Is there any document of how to convert a single region fdb cluster to multiply region configuration safely.
@wangzw The only suggestion is to use dc1 as the primary datacenter name and not to try to rename it.
Do you mean hard coded "dc1"?
datacenter_id
is commented out in foundationdb.conf when we deploy the first single region cluster.
# datacenter_id =
Now we want to reconfigure it to a multiply region cluster. What should we do with datacenter_id
.
@wangzw Yes. I do.
"dc1" is a hardcoded default value and you must not change it when converting to a multi-regional configuration.
Thanks for your fast response.
It seems doesn't work.
@wangzw Yes. I do.
"dc1" is a hardcoded default value and you must not change it when converting to a multi-regional configuration.
It seems doesn't work,
situation 1: if the parameter(datacenter_id) is commented out in foundationdb.conf when we deploy the first single region cluster,
fdb> getrangekeys \xff/tagLocalityList
Range limited to 25 keys \xff/tagLocalityList/\x01\x03\x00\x00\x00
situation 2: if datacenter_id = dc1 is defined in foundationdb.conf when we deploy the first single region cluster,
fdb> getrangekeys \xff/tagLocalityList
Range limited to 25 keys \xff/tagLocalityList/\x01\x03\x00\x00\x00dc1
The situation 1 causes the database to be unavailable when the usable_region is configured to 2
@oleg68
@wangzw Yes. I do.
"dc1" is a hardcoded default value and you must not change it when converting to a multi-regional configuration.
Hi @oleg68
After deploying a cluster with datacenter_id
commented out, I try to change datacenter_id
to dc1
in configure file, and all data is removed and database become unavailable.
As @clindydeng 's comment above, it seems that datacenter_id
has no default value.
Can we identified this issue as a bug? Any hint to fix this issue?
Hi @oleg68 @xumengpanda
I manually replace system key \xff/tagLocalityList/\x00
by \xff/tagLocalityList/\x01\x03\x00\x00\x00dc1
with same value. Everything is going well.
Is it ok to do such system key modification?