An error occurred while starting the data node
Hi, I'm trying ndb cluster these days. I compiled a debug version binary and tried to launch a ndb cluster on my computer. But got an error while starting the data node. Here is some information:
config.ini:
[ndbd default]
# Options affecting ndbd processes on all data nodes:
NoOfReplicas=2 # Number of fragment replicas
DataMemory=98M # How much memory to allocate for data storage
[ndb_mgmd]
# Management process options:
HostName=127.0.0.1 # Hostname or IP address of management node
DataDir=/root/runtime/ndb_mgmd/data # Directory for management node log files
[ndbd]
# Options for data node "A":
# (one [ndbd] section per data node)
HostName=127.0.0.1 # Hostname or IP address
NodeId=2 # Node ID for this data node
DataDir=/root/runtime/ndbd_1/data # Directory for this data node's data files
[ndbd]
# Options for data node "B":
HostName=127.0.0.1 # Hostname or IP address
NodeId=3 # Node ID for this data node
DataDir=/root/runtime/ndbd_2/data # Directory for this data node's data files
[mysqld]
# SQL node options:
HostName=127.0.0.1 # Hostname or IP address
# (additional mysqld connections can be
# specified for this node for various
# purposes such as running ndb_restore)
my.cnf
[mysqld]
# Options for mysqld process:
ndbcluster # run NDB storage engine
[mysql_cluster]
# Options for NDB Cluster processes:
ndb-connectstring=127.0.0.1 # location of management server
data node error log
2023-07-10 15:24:05 [ndbd] INFO -- Angel pid: 26851 started child: 26852
2023-07-10 15:24:05 [ndbd] INFO -- Wrote data node PID: 26852 into pidfile /root/runtime/ndbd_1/data/ndb_2.pid
2023-07-10 15:24:05 [ndbd] INFO -- Normal start of data node using checkpoint and log info if existing
2023-07-10 15:24:05 [ndbd] INFO -- Configuration fetched from '127.0.0.1:1186', generation: 1
2023-07-10 15:24:05 [ndbd] INFO -- Changing directory to '/root/runtime/ndbd_1/data'
2023-07-10 15:24:05 [ndbd] INFO -- Activating node 1
2023-07-10 15:24:05 [ndbd] INFO -- Activating node 2
2023-07-10 15:24:05 [ndbd] INFO -- Activating node 3
2023-07-10 15:24:05 [ndbd] INFO -- Activating node 4
2023-07-10 15:24:05 [ndbd] INFO -- SchedulerSpinTimer = 0
2023-07-10 15:24:05 [ndbd] INFO -- AutomaticThreadConfig = 1, NumCPUs = 0
2023-07-10 15:24:05 [ndbd] INFO -- Use automatic thread configuration
2023-07-10 15:24:05 [ndbd] INFO -- Auto thread config uses:
8 LDM threads,
8 Query threads,
8 tc threads,
16 Recover threads,
1 main threads,
1 rep threads,
4 recv threads,
2 send threads
2023-07-10 15:24:05 [ndbd] INFO -- Number of RR Groups = 1
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
2023-07-10 15:24:05 [ndbd] INFO -- MaxNoOfTriggers set to 200000
2023-07-10 15:24:05 [ndbd] INFO -- Automatic Memory Configuration start
2023-07-10 15:24:05 [ndbd] INFO -- SchemaMemory is 587 MBytes
2023-07-10 15:24:05 [ndbd] INFO -- TransactionMemory is 300 MBytes
2023-07-10 15:24:05 [ndbd] INFO -- Redo log buffer size total are 0 MBytes
2023-07-10 15:24:05 [ndbd] INFO -- Undo log buffer is 0 MBytes
2023-07-10 15:24:05 [ndbd] INFO -- LongMessageBuffer is 51539607572 MBytes <------------------ HERE IS THE PROBLEM
2023-07-10 15:24:05 [ndbd] INFO -- Send buffer sizes are 24 MBytes
2023-07-10 15:24:05 [ndbd] INFO -- Job buffer sizes are 0 MBytes
2023-07-10 15:24:05 [ndbd] INFO -- Static overhead is 208 MBytes
2023-07-10 15:24:05 [ndbd] INFO -- OS overhead is 2667 MBytes
2023-07-10 15:24:05 [ndbd] INFO -- Backup Page memory is 0 MBytes
2023-07-10 15:24:05 [ndbd] INFO -- Restore memory is 0 MBytes
2023-07-10 15:24:05 [ndbd] INFO -- Packed signal memory is 0 MBytes
2023-07-10 15:24:05 [ndbd] INFO -- NDBFS memory is 32 MBytes
2023-07-10 15:24:05 [ndbd] INFO -- SharedGlobalMemory is 700 MBytes
2023-07-10 15:24:05 [ndbd] INFO -- Total memory is 126736 MBytes
2023-07-10 15:24:05 [ndbd] INFO -- Used memory is 51539612090 MBytes
2023-07-10 15:24:05 [ndbd] INFO -- AutomaticMemoryConfig mode requires at least 512 MByte of space for DataMemory and DiskPageBufferMemory
2023-07-10 15:24:05 [ndbd] ALERT -- Not enough memory using automatic memory config, exiting, required 5050 MBytes
stack_bottom = 0 thread_stack 0x0
/tmp/build/bin/ndbd(my_print_stacktrace(unsigned char const*, unsigned long)+0x2e) [0x8fc0ee]
/tmp/build/bin/ndbd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x2e) [0x856dfe]
/tmp/build/bin/ndbd(Configuration::setupConfiguration()+0x99d) [0x8763dd]
/tmp/build/bin/ndbd(ndbd_run(bool, int, char const*, int, char const*, bool, bool, bool, unsigned int, int, int, unsigned long)+0x28b) [0x4fa98b]
/tmp/build/bin/ndbd(real_main(int, char**)+0x513) [0x4f90c3]
/tmp/build/bin/ndbd(angel_run(char const*, Vector<BaseString> const&, char const*, int, char const*, bool, bool, bool, int, int)+0x10b2) [0x4f8af2]
/tmp/build/bin/ndbd(real_main(int, char**)+0x434) [0x4f8fe4]
/tmp/build/bin/ndbd(main+0x3a) [0x4f51da]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fc6cc308555]
/tmp/build/bin/ndbd() [0x4f6860]
2023-07-10 15:24:05 [ndbd] ALERT -- Node 2: Forced node shutdown completed. Occurred during startphase 0. Caused by error 2350: 'Invalid configuration received from Management Server(Configuration error). Permanent error, external action needed'.
The data node failed to start successfully because the "LongMessageBuffer" was wrongly calculated to a very large number, exceeding the total memory size. Upon reviewing the related code, I found the issue in the get_and_set_long_message_buffer() function. Specifically, in the get_num_threads() call, the thread number is calculated using fields from the globalData object. However, these fields are uninitialized (set to 0) if globalData.isNdbMt==false when the setupConfiguration() function executes. This results in an incorrect thread count being used in the LongMessageBuffer calculation, leading to the excessively large buffer size that exceeds total memory.
void
Configuration::setupConfiguration(){
......
/**
* This is parts of get_multithreaded_config
*/
do
{
globalData.isNdbMt = NdbIsMultiThreaded();
g_eventLogger->info("Fxxk globalData.isNdbMt: %u", globalData.isNdbMt);
if (!globalData.isNdbMt) <----------------BREAK HERE, SO globalData IS NOT INITIALIZED
break;
......
} while(0);
......
if (automatic_memory_config)
{
if (!calculate_automatic_memory(it_p)) <----------------------------FAIL HERE
{
ERROR_SET(fatal, NDBD_EXIT_INVALID_CONFIG,
"Invalid configuration fetched",
"Could not handle automatic memory config");
DBUG_VOID_RETURN;
}
}
......
}
ROOT CAUSE:
Uint32
Configuration::get_num_threads()
{
Uint32 num_ldm_threads = globalData.ndbMtLqhThreads; <---------------0
Uint32 num_tc_threads = globalData.ndbMtTcThreads;
Uint32 num_query_threads = globalData.ndbMtQueryThreads;
Uint32 num_main_threads = globalData.ndbMtMainThreads;
Uint32 num_recv_threads = globalData.ndbMtReceiveThreads;
return num_ldm_threads +
num_tc_threads +
num_query_threads +
num_main_threads +
num_recv_threads;
}
Uint64
Configuration::get_and_set_long_message_buffer(
const ndb_mgm_configuration_iterator *p)
{
Uint32 long_signal_buffer = 0;
ndb_mgm_get_int_parameter(p, CFG_DB_LONG_SIGNAL_BUFFER, &long_signal_buffer);
Uint64 long_signal_buffer64 = Uint64(long_signal_buffer);
if (long_signal_buffer64 == 0)
{
Uint32 num_threads = get_num_threads(); <---------------------RETRUN 0
g_eventLogger->info("Fxxk num_threas: %u", num_threads);
long_signal_buffer64 = (Uint64(32) * MBYTE64);
long_signal_buffer64 += (Uint64(num_threads - 1) * Uint64(12) * MBYTE64); <----Uint64(0-1), GOT BIG UINT64
}
globalData.theLongSignalMemory = long_signal_buffer64;
return long_signal_buffer64;
}
Temporary Solution: setting LongMessageBuffer in config.ini can avoid this problem, but I think we'd better avoid it in the code :)
I gather that you started with the ndbd binary. The ndbd is removed from RonDB binary tarball. But it seems that it is not properly removed in the build process. The proper solution is to ensure that the ndbd isn't even built anymore. The short term solution is to abort if isNdbMt is false and print a message saying that ndbd is deprecated and that ndbmtd should be used instead.
Thx for the catch and the analysis.