foundationdb icon indicating copy to clipboard operation
foundationdb copied to clipboard

'Replication health' continues to be in the '(Re)initializing automatic data distribution' phase

Open Felix-zhoux opened this issue 2 years ago • 4 comments

FDB version 7.2.0 Cluster size: 18 nodes in total, Redundancy mode use three_datacenter, 6 nodes per DC

Here is what I see in fdbcli:

# fdbcli 
Using cluster file `/etc/foundationdb/fdb.cluster'.

The database is available, but has issues (type 'status' for more information).

Welcome to the fdbcli. For help, type `help'.
fdb> 
fdb> 
fdb> status

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - three_datacenter
  Storage engine         - ssd-2
  Encryption at-rest     - disabled
  Coordinators           - 7
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 502
  Zones                  - 502
  Machines               - 18
  Memory availability    - 8.0 GB per process on machine with least available
  Retransmissions rate   - 78 Hz
  Fault Tolerance        - 3 zones
  Server time            - 01/16/23 10:34:52

Data:
  Replication health     - (Re)initializing automatic data distribution
  Moving data            - unknown (initializing)
  Sum of key-value sizes - unknown
  Disk space used        - 50.634 GB

Operating space:
  Storage server         - 1697.8 GB free on most full server
  Log server             - 1679.8 GB free on most full server

Workload:
  Read rate              - 303 Hz
  Write rate             - 87 Hz
  Transactions started   - 104 Hz
  Transactions committed - 87 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 01/16/23 10:34:52

fdb> 

The data_distributor process log has the following error:

<Event Severity="20" Time="1673429958.603785" DateTime="2023-01-11T09:39:18Z" Type="Net2RunLoopTrace" ID="0000000000000000" TraceTime="1673429960.710971" Trace="addr2line -e fdbserver.debug -p -C -f -i 0x7faca699d630 0x4680045 0x1794715 0x174c456 0x1765979 0x176b7ac 0x17c6706 0x17c6b73 0x17c6c5b 0x17ce0e5 0x15a3040 0x438bb32 0xc82397 0x7faca65e2555 0xd01412" ThreadID="8092011216532190504" Machine="10.181.159.65:7500" LogGroup="default" Roles="DD" />

# addr2line -e fdbserver.debug -p -C -f -i 0x7faca699d630 0x4680045 0x1794715 0x174c456 0x1765979 0x176b7ac 0x17c6706 0x17c6b73 0x17c6c5b 0x17ce0e5 0x15a3040 0x438bb32 0xc82397 0x7faca65e2555 0xd01412
?? ??:0
free_fastpath at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/Jemalloc_project-prefix/src/Jemalloc_project/src/jemalloc.c:3085
 (inlined by) free at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/Jemalloc_project-prefix/src/Jemalloc_project/src/jemalloc.c:3161
__gnu_cxx::new_allocator<std::_Rb_tree_node<std::string> >::deallocate(std::_Rb_tree_node<std::string>*, unsigned long) at /opt/rh/devtoolset-8/root/usr/include/c++/8/ext/new_allocator.h:125
 (inlined by) std::allocator_traits<std::allocator<std::_Rb_tree_node<std::string> > >::deallocate(std::allocator<std::_Rb_tree_node<std::string> >&, std::_Rb_tree_node<std::string>*, unsigned long) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/alloc_traits.h:462
 (inlined by) std::_Rb_tree<std::string, std::string, std::_Identity<std::string>, std::less<std::string>, std::allocator<std::string> >::_M_put_node(std::_Rb_tree_node<std::string>*) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_tree.h:603
 (inlined by) std::_Rb_tree<std::string, std::string, std::_Identity<std::string>, std::less<std::string>, std::allocator<std::string> >::_M_drop_node(std::_Rb_tree_node<std::string>*) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_tree.h:670
 (inlined by) std::_Rb_tree<std::string, std::string, std::_Identity<std::string>, std::less<std::string>, std::allocator<std::string> >::_M_erase(std::_Rb_tree_node<std::string>*) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_tree.h:1874
std::string::_M_rep() const at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3322
 (inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3640
 (inlined by) void __gnu_cxx::new_allocator<std::_Rb_tree_node<std::string> >::destroy<std::string>(std::string*) at /opt/rh/devtoolset-8/root/usr/include/c++/8/ext/new_allocator.h:140
 (inlined by) void std::allocator_traits<std::allocator<std::_Rb_tree_node<std::string> > >::destroy<std::string>(std::allocator<std::_Rb_tree_node<std::string> >&, std::string*) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/alloc_traits.h:487
 (inlined by) std::_Rb_tree<std::string, std::string, std::_Identity<std::string>, std::less<std::string>, std::allocator<std::string> >::_M_destroy_node(std::_Rb_tree_node<std::string>*) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_tree.h:661
 (inlined by) std::_Rb_tree<std::string, std::string, std::_Identity<std::string>, std::less<std::string>, std::allocator<std::string> >::_M_drop_node(std::_Rb_tree_node<std::string>*) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_tree.h:669
 (inlined by) std::_Rb_tree<std::string, std::string, std::_Identity<std::string>, std::less<std::string>, std::allocator<std::string> >::_M_erase(std::_Rb_tree_node<std::string>*) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_tree.h:1874
 (inlined by) std::_Rb_tree<std::string, std::string, std::_Identity<std::string>, std::less<std::string>, std::allocator<std::string> >::~_Rb_tree() at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_tree.h:965
 (inlined by) std::set<std::string, std::less<std::string>, std::allocator<std::string> >::~set() at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_set.h:281
 (inlined by) DDTeamCollection::isValidLocality(Reference<IReplicationPolicy>, LocalityData const&) const at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DDTeamCollection.actor.cpp:3696
Reference<IReplicationPolicy>::~Reference() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/include/flow/FastRef.h:125 (discriminator 1)
 (inlined by) DDTeamCollection::addBestMachineTeams(int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DDTeamCollection.actor.cpp:4068 (discriminator 1)
DDTeamCollection::addTeamsBestOf(int, int, int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DDTeamCollection.actor.cpp:4479
DDTeamCollectionImpl::BuildTeamsActorState<DDTeamCollectionImpl::BuildTeamsActor>::a_body1cont2(int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DDTeamCollection.actor.cpp:599
DDTeamCollectionImpl::BuildTeamsActorState<DDTeamCollectionImpl::BuildTeamsActor>::a_body1cont1break1(int) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:3618
DDTeamCollectionImpl::BuildTeamsActorState<DDTeamCollectionImpl::BuildTeamsActor>::a_body1cont1loopBody1(int) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:3605 (discriminator 5)
DDTeamCollectionImpl::BuildTeamsActorState<DDTeamCollectionImpl::BuildTeamsActor>::a_body1cont1loopHead1(int) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:3574
 (inlined by) DDTeamCollectionImpl::BuildTeamsActorState<DDTeamCollectionImpl::BuildTeamsActor>::a_body1cont1(Void const&, int) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:3365
 (inlined by) DDTeamCollectionImpl::BuildTeamsActorState<DDTeamCollectionImpl::BuildTeamsActor>::a_body1when1(Void const&, int) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:3380
 (inlined by) DDTeamCollectionImpl::BuildTeamsActorState<DDTeamCollectionImpl::BuildTeamsActor>::a_callback_fire(ActorCallback<DDTeamCollectionImpl::BuildTeamsActor, 0, Void>*, Void const&) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:3401
 (inlined by) ActorCallback<DDTeamCollectionImpl::BuildTeamsActor, 0, Void>::fire(Void const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/include/flow/flow.h:1313
void SAV<Void>::send<Void>(Void&&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/include/flow/flow.h:654
Promise<Void>::~Promise() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/include/flow/flow.h:922
 (inlined by) N2::Net2::PromiseTask::~PromiseTask() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Net2.actor.cpp:253
 (inlined by) N2::Net2::PromiseTask::operator()() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Net2.actor.cpp:260
 (inlined by) N2::Net2::run() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Net2.actor.cpp:1492
main at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/fdbserver.actor.cpp:2310 (discriminator 4)
?? ??:0
_start at ??:?

image

What could be the reason?

Felix-zhoux avatar Jan 16 '23 12:01 Felix-zhoux

Looks like you have 18 undesired storage servers. Were they all excluded? If so, you may want to either include them back, or adding more storage servers.

DD found no healthy team in the system. And if you make changes suggested above, DD can build healthy teams.

jzhou77 avatar Jan 16 '23 17:01 jzhou77

Looks like you have 18 undesired storage servers. Were they all excluded?

No, they are not excluded. If I execute fdbtop process issue I get a reply that all processes are healthy.

Felix-zhoux avatar Jan 17 '23 01:01 Felix-zhoux

It's likely your configuration has some problems, e.g., locality setting, DC ID. It's hard to tell without looking at them.

jzhou77 avatar Jan 23 '23 17:01 jzhou77

The DD crash issue can be repeated: https://forums.foundationdb.org/t/dd-crashed-when-storage-servers-exceed-1200/4144/3

Rjerk avatar Sep 15 '23 03:09 Rjerk