foundationdb icon indicating copy to clipboard operation
foundationdb copied to clipboard

Foundationdb loses all the data after renaming datacenter_id and setting up regions

Open oleg68 opened this issue 4 years ago • 9 comments

Problem statement

If fdbserver datacenter_id differs from the one was when the database was created, setting up regions causes deleting all data from the foundationdb cluster.

Steps to reproduce

  1. Set up a three-node Primary FDB cluster with datacenter_id = dummy parameter
  2. Create the database with ``configure new ssd double```
  3. Restart the all fdbserver processes in the cluster with datacenter_id = dc1 parameter
  4. Make a simple region description in the regions.json file
{
  "regions": [
    {"datacenters": [{"id": "dc1", "priority": 6}]}, 
    {"datacenters": [{"id": "dc2", "priority": -1}]}
  ]
}
  1. Configure regions in the database with fdbcli: fileconfigure FORCE regions.json The FORCE is necessary because there is no any fdbserver processes in dc2 yet
  2. Create and start another three fdb nodes with datacenter_id = dc2 parameter
  3. Try to start replication from dc1 to dc2 with fdbcli: configure usable_regions:=2

Expected result

The database should remain available. The data should start replicating from dc1 to dc2

Actual result

WARNING: Long delay (Ctrl-C to interrupt)

The database is unavailable; type `status' for more information.

Findings

After trying to change usable_regions all the storage files of all fdbservers in dc1 get deleted and do not more exist.

Despite restarting fdbserver processes with the new datacenter_id = dc1, the old datacenter_id is still present in the system space of the database and reserves the locality tag 0.

fdb> getrangekeys \xff/tagLocalityList \xff/tagLocalityList0

Range limited to 25 keys
`\xff/tagLocalityList/\x01\x03\x00\x00\x00dummy'

fdb> get \xff/tagLocalityList/\x01\x03\x00\x00\x00dummy
`\xff/tagLocalityList/\x01\x03\x00\x00\x00dummy' is `\x01\x00\x01b\xb0\x00\xdb\x0f\x00'

The new datacenters_id s dynamically got the next locality tags 1 and 2 (recruitEverything->newTLogServers in fdbserver/masterserver.actor.cpp)

  1. storageServerCore receives dbInfoChange
  2. storageServerCore calls TagPartitionedLogSystem::peekSingle
  3. TagPartitionedLogSystem::peekSingle calls TagPartitionedLogSystem::peekLocal
  4. TagPartitionedLogSystem::peekLocal scans tLogs for a log with the locality 0 (got from the storage tag). But there are only logs with locality 1 and 2 and there are no logs with locality 0.
  5. In this case TagPartitionedLogSystem::peekLocal traces event TLogPeekLocalNoBestSet and throws worker_removed()
  6. the storageServerCore catches this exception worker_removed and calls storageServerTerminated
  7. storageServerTerminated removes all data files, that causes the database unavailability.

Proposal

  • The best solution would be to automatically manage the locality dictionary in the system space and add the new datacenter_id with the same locality tag.
  • Another acceptable solution is to add checks to configure and fileconfigure that all localities registered in \xff/tagLocalityList present in the region configuration
  • Maybe add check on starting storageserver that it's datacenter_id is registered in the \xff/tagLocalityList and refuse to start with a clear error message
  • The minimal is to mention in the documentation that changing datacenter_is is strongly forbidden after database has been created

oleg68 avatar Nov 19 '20 16:11 oleg68

Any update of this issue? Is it a bug, feature or incorrect use of fdb? We experienced the same situation when we want to convert a fdb cluster to multiply region configuration.

Is there any document of how to convert a single region fdb cluster to multiply region configuration safely.

wangzw avatar Jun 22 '22 11:06 wangzw

@wangzw The only suggestion is to use dc1 as the primary datacenter name and not to try to rename it.

oleg68 avatar Jun 22 '22 12:06 oleg68

Do you mean hard coded "dc1"?

datacenter_id is commented out in foundationdb.conf when we deploy the first single region cluster.

# datacenter_id =

Now we want to reconfigure it to a multiply region cluster. What should we do with datacenter_id.

wangzw avatar Jun 22 '22 12:06 wangzw

@wangzw Yes. I do.

"dc1" is a hardcoded default value and you must not change it when converting to a multi-regional configuration.

oleg68 avatar Jun 22 '22 12:06 oleg68

Thanks for your fast response.

wangzw avatar Jun 22 '22 12:06 wangzw

It seems doesn't work.

@wangzw Yes. I do.

"dc1" is a hardcoded default value and you must not change it when converting to a multi-regional configuration.

It seems doesn't work, situation 1: if the parameter(datacenter_id) is commented out in foundationdb.conf when we deploy the first single region cluster, fdb> getrangekeys \xff/tagLocalityList Range limited to 25 keys \xff/tagLocalityList/\x01\x03\x00\x00\x00

situation 2: if datacenter_id = dc1 is defined in foundationdb.conf when we deploy the first single region cluster, fdb> getrangekeys \xff/tagLocalityList Range limited to 25 keys \xff/tagLocalityList/\x01\x03\x00\x00\x00dc1

The situation 1 causes the database to be unavailable when the usable_region is configured to 2

clindydeng avatar Jun 30 '22 03:06 clindydeng

@oleg68

clindydeng avatar Jun 30 '22 10:06 clindydeng

@wangzw Yes. I do.

"dc1" is a hardcoded default value and you must not change it when converting to a multi-regional configuration.

Hi @oleg68

After deploying a cluster with datacenter_id commented out, I try to change datacenter_id to dc1 in configure file, and all data is removed and database become unavailable.

As @clindydeng 's comment above, it seems that datacenter_id has no default value.

Can we identified this issue as a bug? Any hint to fix this issue?

wangzw avatar Jul 04 '22 02:07 wangzw

Hi @oleg68 @xumengpanda

I manually replace system key \xff/tagLocalityList/\x00 by \xff/tagLocalityList/\x01\x03\x00\x00\x00dc1 with same value. Everything is going well.

Is it ok to do such system key modification?

wangzw avatar Jul 15 '22 02:07 wangzw