ros_comm icon indicating copy to clipboard operation
ros_comm copied to clipboard

Large delays when communicating with rosmaster on noetic running in docker container on an ARM64 jetson TX2

Open Gugfann opened this issue 3 years ago • 3 comments

Description

We have noticed some problems with setting up topics between nodes on our noetic system that is running on a Jetson TX2. The issue is that it can take a very long time for publishers to be registered with subscribers in certain nodes which results in delays of up to 1 minute when starting the system. During this time, the CPU load for the nodes in question increases to 100% for a single core. We have only started seeing this issue since we recently upgraded to noetic from kinetic (which also meant an OS update). It has also only been possible to recreate on an ARM64 Jetson TX2 running a Balena operating system that hosts a Docker container with the ROS system. The setup is slightly exotic since we built noetic ourselves to run in Ubuntu 18.04, so it is very possible that this is the main problem. But it would be great to hear your thoughts on what could cause a problem like this, so we could at least take that information on to Balena or otherwise. We have not been able to recreate it on a Jetson TX2 that runs anUbuntu 18.04 desktop OS so it is likely caused by the operating system, the container environment, or a combination of the two.

Setup

  • ARM64 Jetson TX2
  • BalenaOS
  • Ubuntu 18.04 Docker container
  • ROS noetic

The problem

When troubleshooting this problem, we have narrowed it down to the publisherUpdate that can sometimes take up to 30 seconds for a single topic. It is mostly isolated to nodes that publish and subscribe to topics in both directions. We noticed it because it could sometimes take almost a minute before certain callbacks were triggered even though a rostopic subscriber to the same topic from the terminal would get all the publications. This makes sense if the subscriber in the node has not connected to the publisher yet.

I have recreated the problem with two simple nodes that just publish and subscribe to topics between eachother:

#include <ros/ros.h>
#include <string>
#include <vector>
#include <std_msgs/Header.h>

template<typename T>
class Subscriber{
public:
    Subscriber(std::string topic_name, ros::NodeHandle &nh): nh_(nh) {
        sub_ = nh_.subscribe<T>(topic_name, 1, &Subscriber::callback, this);
    }
    ~Subscriber(){}

private:
    ros::NodeHandle nh_;
    ros::Subscriber sub_;

    void callback(const typename T::ConstPtr &msg){  }
};

int main(int argc, char* argv[]){
    ros::init(argc, argv, "nodeX", ros::init_options::NoRosout);
    ros::NodeHandle nh("nodeX");
    
    std::vector<Subscriber<std_msgs::Header>> subscriberList;
    std::vector<ros::Publisher> publisherList;
    int topicCount; nh.getParam("/topicCount", topicCount);

    for(int i = 0; i < topicCount; i++){
        publisherList.push_back(nh.advertise<std_msgs::Header>("/topic_X2Y_" + std::to_string(i), 1));
        subscriberList.push_back(Subscriber<std_msgs::Header>("/topic_Y2X_" + std::to_string(i), nh));
    }
        
    ros::spin();

    return 0;
}

When looking at the master.log for the session we can see delays in the publisherUpdate calls.

[rosmaster.master][INFO] 2021-05-26 14:09:12,170: publisherUpdate[/topic_X2Y_0] -> http://5b282d4:50449/ ['http://5b282d4:49351/']: sec=9.43, result=[1, '', 0]
[rosmaster.master][INFO] 2021-05-26 14:09:12,171: publisherUpdate[/topic_X2Y_1] -> http://5b282d4:50449/ ['http://5b282d4:49351/']
[rosmaster.master][INFO] 2021-05-26 14:09:17,103: publisherUpdate[/topic_X2Y_1] -> http://5b282d4:50449/ ['http://5b282d4:49351/']: sec=4.93, result=[1, '', 0]
[rosmaster.master][INFO] 2021-05-26 14:09:17,103: publisherUpdate[/topic_X2Y_2] -> http://5b282d4:50449/ ['http://5b282d4:49351/']
[rosmaster.master][INFO] 2021-05-26 14:09:22,039: publisherUpdate[/topic_X2Y_2] -> http://5b282d4:50449/ ['http://5b282d4:49351/']: sec=4.94, result=[1, '', 0]
[rosmaster.master][INFO] 2021-05-26 14:09:22,039: publisherUpdate[/topic_X2Y_3] -> http://5b282d4:50449/ ['http://5b282d4:49351/']
[rosmaster.master][INFO] 2021-05-26 14:09:26,989: publisherUpdate[/topic_X2Y_3] -> http://5b282d4:50449/ ['http://5b282d4:49351/']: sec=4.95, result=[1, '', 0]
[rosmaster.master][INFO] 2021-05-26 14:09:26,990: publisherUpdate[/topic_X2Y_4] -> http://5b282d4:50449/ ['http://5b282d4:49351/']
[rosmaster.master][INFO] 2021-05-26 14:09:31,929: publisherUpdate[/topic_X2Y_4] -> http://5b282d4:50449/ ['http://5b282d4:49351/']: sec=4.94, result=[1, '', 0]
[rosmaster.master][INFO] 2021-05-26 14:09:31,930: publisherUpdate[/topic_X2Y_5] -> http://5b282d4:50449/ ['http://5b282d4:49351/']
[rosmaster.master][INFO] 2021-05-26 14:09:34,407: publisherUpdate[/topic_X2Y_5] -> http://5b282d4:50449/ ['http://5b282d4:49351/']: sec=2.48, result=[1, '', 0]
[rosmaster.master][INFO] 2021-05-26 14:09:34,408: publisherUpdate[/topic_X2Y_6] -> http://5b282d4:50449/ ['http://5b282d4:49351/']
[rosmaster.master][INFO] 2021-05-26 14:09:39,308: publisherUpdate[/topic_X2Y_6] -> http://5b282d4:50449/ ['http://5b282d4:49351/']: sec=4.90, result=[1, '', 0]
[rosmaster.master][INFO] 2021-05-26 14:09:39,309: publisherUpdate[/topic_X2Y_7] -> http://5b282d4:50449/ ['http://5b282d4:49351/']
[rosmaster.master][INFO] 2021-05-26 14:09:44,232: publisherUpdate[/topic_X2Y_7] -> http://5b282d4:50449/ ['http://5b282d4:49351/']: sec=4.92, result=[1, '', 0]
[rosmaster.master][INFO] 2021-05-26 14:09:44,233: publisherUpdate[/topic_X2Y_8] -> http://5b282d4:50449/ ['http://5b282d4:49351/']
[rosmaster.master][INFO] 2021-05-26 14:09:49,148: publisherUpdate[/topic_X2Y_8] -> http://5b282d4:50449/ ['http://5b282d4:49351/']: sec=4.91, result=[1, '', 0]
[rosmaster.master][INFO] 2021-05-26 14:09:49,149: publisherUpdate[/topic_X2Y_9] -> http://5b282d4:50449/ ['http://5b282d4:49351/']
[rosmaster.master][INFO] 2021-05-26 14:09:54,042: publisherUpdate[/topic_X2Y_9] -> http://5b282d4:50449/ ['http://5b282d4:49351/']: sec=4.89, result=[1, '', 0]
[rosmaster.master][INFO] 2021-05-26 14:09:54,042: publisherUpdate[/topic_X2Y_10] -> http://5b282d4:50449/ ['http://5b282d4:49351/']
[rosmaster.master][INFO] 2021-05-26 14:09:54,047: publisherUpdate[/topic_X2Y_10] -> http://5b282d4:50449/ ['http://5b282d4:49351/']: sec=0.00, result=[1, '', 0]
[rosmaster.master][INFO] 2021-05-26 14:09:54,048: publisherUpdate[/topic_X2Y_11] -> http://5b282d4:50449/ ['http://5b282d4:49351/']
[rosmaster.master][INFO] 2021-05-26 14:09:54,052: publisherUpdate[/topic_X2Y_11] -> http://5b282d4:50449/ ['http://5b282d4:49351/']: sec=0.00, result=[1, '', 0]
[rosmaster.master][INFO] 2021-05-26 14:09:54,053: publisherUpdate[/topic_X2Y_12] -> http://5b282d4:50449/ ['http://5b282d4:49351/']
[rosmaster.master][INFO] 2021-05-26 14:09:54,058: publisherUpdate[/topic_X2Y_12] -> http://5b282d4:50449/ ['http://5b282d4:49351/']: sec=0.00, result=[1, '', 0]
[rosmaster.master][INFO] 2021-05-26 14:09:54,058: publisherUpdate[/topic_X2Y_13] -> http://5b282d4:50449/ ['http://5b282d4:49351/']
[rosmaster.master][INFO] 2021-05-26 14:09:54,063: publisherUpdate[/topic_X2Y_13] -> http://5b282d4:50449/ ['http://5b282d4:49351/']: sec=0.00, result=[1, '', 0]
[rosmaster.master][INFO] 2021-05-26 14:09:54,064: publisherUpdate[/topic_X2Y_14] -> http://5b282d4:50449/ ['http://5b282d4:49351/']
[rosmaster.master][INFO] 2021-05-26 14:09:54,068: publisherUpdate[/topic_X2Y_14] -> http://5b282d4:50449/ ['http://5b282d4:49351/']: sec=0.00, result=[1, '', 0]
[rosmaster.master][INFO] 2021-05-26 14:09:54,069: publisherUpdate[/topic_X2Y_15] -> http://5b282d4:50449/ ['http://5b282d4:49351/']
[rosmaster.master][INFO] 2021-05-26 14:09:54,072: publisherUpdate[/topic_X2Y_15] -> http://5b282d4:50449/ ['http://5b282d4:49351/']: sec=0.00, result=[1, '', 0]
[rosmaster.master][INFO] 2021-05-26 14:09:54,072: publisherUpdate[/topic_X2Y_16] -> http://5b282d4:50449/ ['http://5b282d4:49351/']
[rosmaster.master][INFO] 2021-05-26 14:09:54,074: publisherUpdate[/topic_X2Y_16] -> http://5b282d4:50449/ ['http://5b282d4:49351/']: sec=0.00, result=[1, '', 0]
[rosmaster.master][INFO] 2021-05-26 14:09:54,074: publisherUpdate[/topic_X2Y_17] -> http://5b282d4:50449/ ['http://5b282d4:49351/']
[rosmaster.master][INFO] 2021-05-26 14:09:54,075: publisherUpdate[/topic_X2Y_17] -> http://5b282d4:50449/ ['http://5b282d4:49351/']: sec=0.00, result=[1, '', 0]
[rosmaster.master][INFO] 2021-05-26 14:09:54,076: publisherUpdate[/topic_X2Y_18] -> http://5b282d4:50449/ ['http://5b282d4:49351/']
[rosmaster.master][INFO] 2021-05-26 14:09:54,077: publisherUpdate[/topic_X2Y_18] -> http://5b282d4:50449/ ['http://5b282d4:49351/']: sec=0.00, result=[1, '', 0]
[rosmaster.master][INFO] 2021-05-26 14:09:54,077: publisherUpdate[/topic_X2Y_19] -> http://5b282d4:50449/ ['http://5b282d4:49351/']
[rosmaster.master][INFO] 2021-05-26 14:09:54,079: publisherUpdate[/topic_X2Y_19] -> http://5b282d4:50449/ ['http://5b282d4:49351/']: sec=0.00, result=[1, '', 0]

Running the same nodes on other setups does not produce the same delays. It's at most 0.05s for a single topic. We have tried the following setups:

  • x86_64 ubuntu16.04 kinetic
  • x86_64 ubuntu18.04 noetic
  • x86_64 ubuntu18.04 noetic in docker container
  • aarch64 ubuntu18.04 noetic

rosnode ping

There also seems to be a lot of latency when you try to ping the nodes. This does not improve even when all the topics have been set up. While the ping is trying to contact a node the CPU also usage spikes as it did with the publisherUpdate.

$ rosnode ping -a

Will ping the following nodes: 
 * /nodeY
 * /nodeX
 * /rosout

pinging /nodeY with a timeout of 3.0s
xmlrpc reply from http://5b282d4:50921/ time=2457.330704ms
pinging /nodeX with a timeout of 3.0s
xmlrpc reply from http://5b282d4:49871/ time=2443.327188ms
pinging /rosout with a timeout of 3.0s
xmlrpc reply from http://5b282d4:49927/ time=2008.521795ms

Final thoughts

There seems to be a lot of latency when communicating with the rosmaster. I think it is probably due to some issue with the underlying network configuration in the BalenaOS container environment. But I would still be curious to hear if you had better insight into how the rosmaster XMLRC communication can be affected in this way, and maybe if there are things to do that can mitigate it. The CPU load increasing when trying to ping or set up new topic connections could maybe be interesting to you guys, since that could maybe point to an inefficient waiting loop somewhere or something like that.

Gugfann avatar May 27 '21 10:05 Gugfann

I think #2208 could have fixed this. There is a noetic release planned shortly. Can you test afterwards, or right away with the shadow-fixed repo?

peci1 avatar Jan 19 '22 17:01 peci1

I had a similar problem when running ROS nodes in docker. What helped was reducing the max file handles to 1024 again (as it is normally set without docker). You can set this with ulimit -n 1024 in the container or in docker-compose with

my-service:
   ...
    ulimits:
      nofile:
        soft: 1024
        hard: 1048576

flixr avatar Jan 19 '22 18:01 flixr

I had a similar problem when running ROS nodes in docker. What helped was reducing the max file handles to 1024 again (as it is normally set without docker). You can set this with ulimit -n 1024 in the container or in docker-compose with

my-service:
   ...
    ulimits:
      nofile:
        soft: 1024
        hard: 1048576

This is actually what I ended up doing to solve the problem. It seems to be somehow related to how the XMLRPC library deals with file descriptors inside docker containers. I found this #1927 issue, which details a very similar problem with the same proposed mitigation by limiting the number of allowed file descriptors. The actual fix that came from it does not work in our case since we have that already in our version of noetic. But changing the ulimit has been good enough for us for the time being.

I might give the proposed fix from #2208 a try out of curiosity, so thanks for the suggestion. I will let you know if I learn anything.

Gugfann avatar Jan 19 '22 20:01 Gugfann