redhawk
redhawk copied to clipboard
GPP Limits.cpp handling of file limit checks results in BUSY GPP (ULIMIT)
Summary: The GPP tracks 2 file limit states, one for the gpp process and one for the overall system. If the system limit parsing of /proc/sys/fs/file-nr
fails to result in 3 columns, the system file limit maximum is set to 0. The GPP evaluates the system and gpp process' file limits together, the system one fails, the GPP goes into a permanent BUSY state, and the log message shows the GPP process limits as the reason (which indicates there is no problem).
In Limits.cpp, SysLimits::update_state
, the following is seen:
try{
fname = "/proc/sys/fs/file-nr";
std::ifstream file_nr(fname.c_str(), std::ifstream::in);
if ( !file_nr.good()) throw std::ifstream::failure("unable to open " + fname );
std::string line;
while ( std::getline( file_nr, line ) ) {
std::vector<std::string> values;
boost::algorithm::split( values, line, boost::is_any_of(std::string(" \t")), boost::algorithm::token_compress_on ); // << SEE HERE (1)
DEBUG(" values: " << values.size() << " file-nr line: " << line );
if ( values.size() > 2 ) {
try {
tmp.files = boost::lexical_cast<int64_t>( values[0] );
tmp.files_limit = boost::lexical_cast<int64_t>( values[2] ); // << SEE HERE (2)
}
catch( boost::bad_lexical_cast ){
}
}
}
}
catch( ... ) {
}
If the response from that procfs end point is malformed in some way (in this case, uses spaces instead of tabs causing breakage at 1), the parsing will result in 2 either pulling an incorrect value or being passed over entirely, leaving tmp.files_limit set to its default value, 0. This causes the GPP_i::_fileThresholdCheck
to return TRUE 3:
int gpp_max_open_files = gpp_limits.max_open_files * modified_thresholds.files_available;
int sys_max_open_files = sys_limits.max_open_files * modified_thresholds.files_available;
RH_TRACE(_baseLog, "Update file threshold monitor (GPP), threshold=" << gpp_max_open_files
<< " measured=" << gpp_limits.current_open_files);
RH_TRACE(_baseLog, "Update file threshold monitor (system), threshold=" << sys_max_open_files
<< " measured=" << sys_limits.current_open_files);
if (gpp_limits.current_open_files > gpp_max_open_files) {
return true;
} else if (sys_limits.current_open_files > sys_max_open_files) {
return true; // << SEE HERE 3
}
return false;
This rolls up to GPP_i::updateUsageState
where the log message is output using the GPP's limits, not the sys limits that caused the state change:
// Approximately line 1895
else if (_fileThresholdMonitor->is_threshold_exceeded()) {
std::ostringstream oss;
oss << "Threshold: " << gpp_limits.max_open_files << " Actual: " << gpp_limits.current_open_files;
_setBusyReason("ULIMIT (MAX_FILES)", oss.str());
}
I see this as two things.
First, a code improvement of either better parsing of /proc/sys/fs/file-nr
or additionally utilizing /proc/sys/fs/file-max
to establish the file limit would help make this code a bit more resilient to whitespace differences. Second, a bug exists where the log message is not correctly indicating why the GPP became unusable (BUSY). In this case, it was outputting that the threshold as 524288 and the actual value was 14 (i.e., it's saying 524288 < 14 == True to the user), which is nonsensical.
There is also an issue where the tmp.files_limit
is int64 and sys_limits.max_open_files
is CORBA::LONG (32 bits). The casting is unsafe curently semit protected by a mistaken check for -1 on master.