WeeWX-MQTTSubscribe icon indicating copy to clipboard operation
WeeWX-MQTTSubscribe copied to clipboard

MQTT problems cause WeeWX consuming all the CPU

Open roe-dl opened this issue 2 years ago • 13 comments

Observations:

  • The WeeWX process consumes more than 95 percent of the CPU.
  • The syslog is full of complaints of the MQTTSubscribe extension as shown below.
  • Restarting WeeWX or the whole computer does not help.
  • The computer becomes very slow.
  • This always happens when the MQTT broker is overloaded and responses delayed or not in order.
Mar  2 19:01:31 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677779978.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:31 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677779980.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:32 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677779983.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:32 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677779985.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:32 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677779988.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:32 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677779990.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:32 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677779993.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:32 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677779995.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:32 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677779998.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:32 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780000.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:32 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780003.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:33 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780005.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:34 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780008.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:36 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780010.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:36 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780013.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:37 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780015.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:38 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780018.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:38 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780020.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:38 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780023.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:39 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780025.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:39 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780028.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:39 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780030.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:39 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780033.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:39 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780035.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:39 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780038.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:39 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780040.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:39 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780043.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:39 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780045.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:39 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780048.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:39 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780050.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:39 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780053.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:39 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780055.000000 which is prior to previous packet 1677780068.000000
Mar  2 19:01:43 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780057.000000 which is prior to previous packet 1677780101.000000
Mar  2 19:01:43 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780059.000000 which is prior to previous packet 1677780101.000000
Mar  2 19:01:43 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780062.000000 which is prior to previous packet 1677780101.000000
Mar  2 19:01:43 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780064.000000 which is prior to previous packet 1677780101.000000
Mar  2 19:01:43 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780067.000000 which is prior to previous packet 1677780101.000000
Mar  2 19:01:43 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780069.000000 which is prior to previous packet 1677780101.000000
Mar  2 19:01:43 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780072.000000 which is prior to previous packet 1677780101.000000
Mar  2 19:01:44 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780074.000000 which is prior to previous packet 1677780101.000000
Mar  2 19:01:44 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780077.000000 which is prior to previous packet 1677780101.000000
Mar  2 19:01:44 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780079.000000 which is prior to previous packet 1677780101.000000
Mar  2 19:02:07 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780082.000000 which is prior to previous packet 1677780117.000000
Mar  2 19:02:08 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780084.000000 which is prior to previous packet 1677780117.000000
Mar  2 19:02:10 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780087.000000 which is prior to previous packet 1677780117.000000
Mar  2 19:02:42 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780089.000000 which is prior to previous packet 1677780153.000000
Mar  2 19:02:43 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780092.000000 which is prior to previous packet 1677780153.000000
Mar  2 19:02:44 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780094.000000 which is prior to previous packet 1677780153.000000
Mar  2 19:02:44 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780097.000000 which is prior to previous packet 1677780153.000000
Mar  2 19:02:44 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780099.000000 which is prior to previous packet 1677780153.000000
Mar  2 19:02:44 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780102.000000 which is prior to previous packet 1677780153.000000
Mar  2 19:02:44 LokalWiki weewx[219308] ERROR user.MQTTSubscribe: (Service) Ignoring packet has dateTime of 1677780104.000000 which is prior to previous packet 1677780153.000000

roe-dl avatar Mar 02 '23 18:03 roe-dl

You could try setting ignore_start_time = True. This will result in MQTTSubscribe processing the incoming data that has a time ‘prior to previous packet’. Some additional information can found here, https://github.com/bellrichm/WeeWX-MQTTSubscribe/wiki/Date-Time-processing#mqttsubscribeservice-processing Because WeeWX/MQTTSubscribe id being ‘flooded’ with multiple incoming messages per second, I’m not sure this will solve the performance issue.

bellrichm avatar Mar 03 '23 14:03 bellrichm

Yes, I could set ignore_start_time = True, but that is not what I meant. Indeed, in this case, the base thing to do is to fix the overload situation of the MQTT broker. I did so. So no problem at my side any more.

What I meant is this: The MQTTSubscribe service should be able to handle situations when packets come in irregularly because of external conditions.

roe-dl avatar Mar 03 '23 17:03 roe-dl

Sorry, I’m not following what you are requesting. Setting the various time configuration settings is for the exact situation where the time in the MQTT payload ‘cannot be trusted’.

bellrichm avatar Mar 03 '23 19:03 bellrichm

This issue has been marked stale because it has been waiting for more information for 4 weeks.

github-actions[bot] avatar Apr 03 '23 00:04 github-actions[bot]

The time in the MQTT payload can be trusted. Because of some temporary overload situation the packets did not come in order any more or they came delayed. Such temporary situation should not result in an unrecoverable error condition.

roe-dl avatar Apr 04 '23 10:04 roe-dl

The error message is because the driver is emitting loop packets out of order. What driver are you using?

I agree its not good for WeeWX to consume the CPU, as of now I don’t know of a way for MQTTSubscribe to safe guard this. I’ll keep this open for a bit and attempt to to recreate the condition.

bellrichm avatar Apr 04 '23 17:04 bellrichm

Current plan. If the driver emits X ‘stale’ packets in a row OR Y total ‘stale’ packets, MQTTSubscribe will shut itself down. X and Y will be configurable.

bellrichm avatar Apr 04 '23 20:04 bellrichm

Below I have reformatted the log snippet to make what is happening more apparent. Using the log time of 19:01:39 as an example. In this one second WeeWX has dispatched 13 new_loop_packet events. Each of these events has been raised because the driver has generated a loop packet. Each of these loop packet have a time difference of 2 to 3 seconds. The times are all in the past. That is a lot of CPU cycles being used outside of MQTTSubscribe. Because the loop packet is ‘in the past’, the main thread of MQTTSubscribe logs it and ignores it. This would add some additional CPU processing, but minimal. There are no errors in the snippet with errors dealing with the incoming MQTT messages. So it appears that this thread is processing as normal (yes this a guess based on the information available). And unless the broker is flooding the topic with messages, it would not be consuming an unusual amount of CPU.

I realized that I can’t gracefully shutdown MQTTSubscribe. All that MQTTSubscribe can do is raise an exception which will shutdown WeeWX. I’ll think a bit more, but right now I don’t think there is much MQTTSubscribe can do.

        Log time        Packet time		        Previous packet time		
	Mar  2 19:01:31	1677779978	18:59:38	1677780068	19:01:08	
	Mar  2 19:01:31	1677779980	18:59:40	1677780068		
	Mar  2 19:01:32	1677779983	18:59:43	1677780068		
	Mar  2 19:01:32	1677779985	18:59:45	1677780068		
	Mar  2 19:01:32	1677779988	18:59:48	1677780068		
	Mar  2 19:01:32	1677779990	18:59:50	1677780068		
	Mar  2 19:01:32	1677779993	18:59:53	1677780068		
	Mar  2 19:01:32	1677779995	18:59:55	1677780068		
	Mar  2 19:01:32	1677779998	18:59:58	1677780068		
	Mar  2 19:01:32	1677780000	19:00:00	1677780068		
	Mar  2 19:01:32	1677780003	19:00:03	1677780068		
	Mar  2 19:01:33	1677780005	19:00:05	1677780068		
	Mar  2 19:01:34	1677780008	19:00:08	1677780068		
	Mar  2 19:01:36	1677780010	19:00:10	1677780068		
	Mar  2 19:01:36	1677780013	19:00:13	1677780068		
	Mar  2 19:01:37	1677780015	19:00:15	1677780068		
	Mar  2 19:01:38	1677780018	19:00:18	1677780068		
	Mar  2 19:01:38	1677780020	19:00:20	1677780068		
	Mar  2 19:01:38	1677780023	19:00:23	1677780068		
	Mar  2 19:01:39	1677780025	19:00:25	1677780068		
	Mar  2 19:01:39	1677780028	19:00:28	1677780068		
	Mar  2 19:01:39	1677780030	19:00:30	1677780068		
	Mar  2 19:01:39	1677780033	19:00:33	1677780068		
	Mar  2 19:01:39	1677780035	19:00:35	1677780068		
	Mar  2 19:01:39	1677780038	19:00:38	1677780068		
	Mar  2 19:01:39	1677780040	19:00:40	1677780068		
	Mar  2 19:01:39	1677780043	19:00:43	1677780068		
	Mar  2 19:01:39	1677780045	19:00:45	1677780068		
	Mar  2 19:01:39	1677780048	19:00:48	1677780068		
	Mar  2 19:01:39	1677780050	19:00:50	1677780068		
	Mar  2 19:01:39	1677780053	19:00:53	1677780068		
	Mar  2 19:01:39	1677780055	19:00:55	1677780068	19:01:08	
	Mar  2 19:01:43	1677780057	19:00:57	1677780101	19:01:41	
	Mar  2 19:01:43	1677780059	19:00:59	1677780101		
	Mar  2 19:01:43	1677780062	19:01:02	1677780101		
	Mar  2 19:01:43	1677780064	19:01:04	1677780101		
	Mar  2 19:01:43	1677780067	19:01:07	1677780101		
	Mar  2 19:01:43	1677780069	19:01:09	1677780101		
	Mar  2 19:01:43	1677780072	19:01:12	1677780101		
	Mar  2 19:01:44	1677780074	19:01:14	1677780101		
	Mar  2 19:01:44	1677780077	19:01:17	1677780101		
	Mar  2 19:01:44	1677780079	19:01:19	1677780101	19:01:41	
	Mar  2 19:02:07	1677780082	19:01:22	1677780117	19:01:57	
	Mar  2 19:02:08	1677780084	19:01:24	1677780117		
	Mar  2 19:02:10	1677780087	19:01:27	1677780117	19:01:57	
	Mar  2 19:02:42	1677780089	19:01:29	1677780153	19:02:33	
	Mar  2 19:02:43	1677780092	19:01:32	1677780153		
	Mar  2 19:02:44	1677780094	19:01:34	1677780153		
	Mar  2 19:02:44	1677780097	19:01:37	1677780153		
	Mar  2 19:02:44	1677780099	19:01:39	1677780153		
	Mar  2 19:02:44	1677780102	19:01:42	1677780153		
	Mar  2 19:02:44	1677780104	19:01:44	1677780153

bellrichm avatar Apr 06 '23 19:04 bellrichm

What driver are you using? What did you do to workaround the problem?

bellrichm avatar Apr 06 '23 20:04 bellrichm

I use the driver WeatherLinkLiveUDP. It receives LOOP messages every 2 to 3 seconds. I have a WeatherLinkLive 6100 device from Davis.

There was no workaround, there was a solution. The base reason was a problem of the RAM of the computer. It resulted in too small memory available. Changing the RAM module solved it.

The problem of too much log messages of the same kind I had myself, when I wrote some code. In that case I set a flag after creating the log message that prevents sending the same log message for some time. Something like this:

    if self.last_log_time<(time.time()-300):
        logerr(...)
        self.last_log_time = time.time()

So the error is logged at most once every archive interval. You could add a counter to log how much occurrences of the error condition took place since the last log message.

roe-dl avatar Apr 07 '23 10:04 roe-dl

Thanks for the information. I think throttling the error messages is probably the way to go, but I am still thinking about shutting MQTTSubscribe down in these situations.

bellrichm avatar Apr 07 '23 20:04 bellrichm

I think the driver/service should be able to recover if the error condition vanishes. Shutting it down would prevent this.

If the driver/service is shut down in case of error, this would require the user to monitor such events manually to do something about it.

roe-dl avatar Apr 11 '23 11:04 roe-dl

I think the driver/service should be able to recover if the error condition vanishes. Shutting it down would prevent this.

If the driver/service is shut down in case of error, this would require the user to monitor such events manually to do something about it.

Makes sense. I will look into some form of throttling the number of messages. Thanks for your thoughts. rich

bellrichm avatar Apr 16 '23 16:04 bellrichm