redhawk icon indicating copy to clipboard operation
redhawk copied to clipboard

GPP Limits.cpp handling of file limit checks results in BUSY GPP (ULIMIT)

Open btgoodwin opened this issue 4 years ago • 1 comments

Summary: The GPP tracks 2 file limit states, one for the gpp process and one for the overall system. If the system limit parsing of /proc/sys/fs/file-nr fails to result in 3 columns, the system file limit maximum is set to 0. The GPP evaluates the system and gpp process' file limits together, the system one fails, the GPP goes into a permanent BUSY state, and the log message shows the GPP process limits as the reason (which indicates there is no problem).

In Limits.cpp, SysLimits::update_state, the following is seen:

  try{
    fname = "/proc/sys/fs/file-nr";
    std::ifstream file_nr(fname.c_str(), std::ifstream::in);
    if ( !file_nr.good()) throw std::ifstream::failure("unable to open " + fname );
    std::string line;
    while ( std::getline( file_nr, line ) ) {
      std::vector<std::string> values;
      boost::algorithm::split( values, line, boost::is_any_of(std::string(" \t")), boost::algorithm::token_compress_on ); // << SEE HERE (1)
      DEBUG(" values: " << values.size() << "  file-nr line: " << line  );
        
      if ( values.size() > 2 ) {
        try {
          tmp.files = boost::lexical_cast<int64_t>( values[0] );
          tmp.files_limit = boost::lexical_cast<int64_t>( values[2] ); // << SEE HERE (2)
        }
        catch( boost::bad_lexical_cast ){
        }
      }
    }
  }
  catch( ... ) {
  }

If the response from that procfs end point is malformed in some way (in this case, uses spaces instead of tabs causing breakage at 1), the parsing will result in 2 either pulling an incorrect value or being passed over entirely, leaving tmp.files_limit set to its default value, 0. This causes the GPP_i::_fileThresholdCheck to return TRUE 3:

    int gpp_max_open_files = gpp_limits.max_open_files * modified_thresholds.files_available;
    int sys_max_open_files = sys_limits.max_open_files * modified_thresholds.files_available;
    RH_TRACE(_baseLog, "Update file threshold monitor (GPP), threshold=" << gpp_max_open_files
             << " measured=" << gpp_limits.current_open_files);
    RH_TRACE(_baseLog, "Update file threshold monitor (system), threshold=" << sys_max_open_files
             << " measured=" << sys_limits.current_open_files);
    if (gpp_limits.current_open_files > gpp_max_open_files) {
        return true;
    } else if (sys_limits.current_open_files > sys_max_open_files) {
        return true; // << SEE HERE 3
    }
    return false;

This rolls up to GPP_i::updateUsageState where the log message is output using the GPP's limits, not the sys limits that caused the state change:

  // Approximately line 1895
  else if (_fileThresholdMonitor->is_threshold_exceeded()) {
      std::ostringstream oss;
      oss << "Threshold: " << gpp_limits.max_open_files << " Actual: " << gpp_limits.current_open_files;
      _setBusyReason("ULIMIT (MAX_FILES)", oss.str());
  }

I see this as two things.

First, a code improvement of either better parsing of /proc/sys/fs/file-nr or additionally utilizing /proc/sys/fs/file-max to establish the file limit would help make this code a bit more resilient to whitespace differences. Second, a bug exists where the log message is not correctly indicating why the GPP became unusable (BUSY). In this case, it was outputting that the threshold as 524288 and the actual value was 14 (i.e., it's saying 524288 < 14 == True to the user), which is nonsensical.

btgoodwin avatar May 28 '20 11:05 btgoodwin

There is also an issue where the tmp.files_limit is int64 and sys_limits.max_open_files is CORBA::LONG (32 bits). The casting is unsafe curently semit protected by a mistaken check for -1 on master.

meeksjd avatar Apr 25 '23 15:04 meeksjd