lorawan-server icon indicating copy to clipboard operation
lorawan-server copied to clipboard

Resending packets collected while MQTT was down

Open Enemoino opened this issue 3 years ago • 9 comments

From what i understand uplinks are stored in mnesia before handler will parse them into connector. Is there a way for handler to send uplinks that where not sent while mqtt was down for any given reason ? Trying to figure it out i presume that mnesia holds up to 50 messages per single node, this parameter is configurable in server config. I am aware that server would have to remember some connection parameters like last date and time, and would have to parse all mnesia records from this time... i am starting to wonder away from the issue. " After lost/dropped connection the only way to retrieve missing uplinks is by REST API.

  1. Can i fetch them by MQTT ?
  2. Can i resend them when new connection will occure ? Can you give me some hints if it is possible to configure and if not where to look for those events to add some code for it ?

Enemoino avatar May 18 '21 10:05 Enemoino

Hello! Given some consideration, this is not at all a simple matter and may be not possible at all if you look at things globally.

The rxframes table stores only raw information received from the gateway with a few additional sorting keys, missing a few important fields inherited from the device's profile that are required by some applications (appargs, appid, etc). You can't access rxframes other than by REST API.

By design, uplink processing is Handler centric. One received uplink may be sent to multiple different connectors via single Handler. You can't maintain a queue on a Handler basis - for that you must keep track which uplink on which connector was successful or failed, otherwise it will cause massive data duplication on active successful connectors at the price of delivery to one failed. That said, you must keep a separate queue for every single connector with frozen and processed uplinks, with all fields imposed by a Handler attached and lots of mechanics for queue maintenance with policies like persist, override, data shifting on overflow, etc.

Ideologically, and based on the core LoRaWAN protocol assumptions on data delivery reliability, the simplest solution will be to build the application on the fact that uplink data may be lost and just periodically attempt failed connector restart, it should be "cooled down" a bit or it may cause serious DoS issues on a busy system with reasonable uplink traffic. Btw, you still have to deal with missing uplinks anyway - your gateway may fail, network link from the gateway may go down and that will result in data loss on LoRaWAN level even if you are using confirmed uplinks...

altishchenko avatar May 18 '21 16:05 altishchenko

Thank you for your response,

Well i was not aware of this situation of rxframes in database, i have seen them in logs but no directly in mnesia.

Ok so lets consider different approach, something like remote client searching through mnesia records by MQTT. Handler have direct access to mnesia. Is it possible to add a separate handler that would allow me to publish for instance {Gateway MAC}/{Configuration topic}/{Configruation parameters} and write them to mnesia for new nodes configuration and stuff. It seems like more natural way because handlers interfere directly with mnesia for frames parsing. Other idea := Write a parallel app that will subscribe such configuration topic and write it to lorawan_server by REST API.

Yes i agree about the sole purpose of LoRaWAN and reliability but we have approached a problem with connectors raising fault falgs with "badarg" when a system starts and there is no way to physically connect to MQTT server. It made reporting super unreliable. Because if gateway will lose connection with server, next time a rxframe will be ready to push to mqtt, badarg will occure and handler will stop permanently to report frames. I understand how it works in handler code but i dont get why a single failed connection is considered badarg not network problem. Maybe you can tell me what is happening Thank you

Enemoino avatar May 19 '21 07:05 Enemoino

Hello, I don't think I understand correctly what are you trying to achieve with a separate direct connector. May be if you explain your real task there I will be able to give you some other ideas. For example, there was a case, where new unknown nodes were automatically configured into the server as soon as they show up in the air, making use of the server-global events handler - events were received by an app via mqtt/amqp and node provisioning was performed via REST. There are actually many options out of the box, I am just not getting the real task :) Like, you can have an erlang app, that will monitor mnesia rxframes table via standard mnesia guard mechanism (no need to search) and do something as soon as new record being written into it.

As for the connector restart, I understand the problem and will look into it later on, I think I will just create a separate entity in the supervision tree that will attempt failed connector restarts at reasonable intervals. Will that suffice?

altishchenko avatar May 19 '21 09:05 altishchenko

I understand how it works in handler code but i dont get why a single failed connection is considered badarg not network problem. Maybe you can tell me what is happening

Oh, and for this, can you give me a debug.log of the starting server where this happens? It must be 'network' failure, if it is 'badarg', then something else is not right.

altishchenko avatar May 19 '21 09:05 altishchenko

As for the connector restart, I understand the problem and will look into it later on, I think I will just create a separate entity in the supervision tree that will attempt failed connector restarts at reasonable intervals. Will that suffice?

that would be great, its exactly what i had in mind. If this supervisor could have 2 options like 1# Restart if connection failed and has to be ready for next attempt or 2# restart on specific interval. That would make my life a lot easier. I was afraid i would have learn me some erland for good but hard way, deep diving into this mechanism.

Oh, and for this, can you give me a debug.log of the starting server where this happens? It must be 'network' failure, if it is 'badarg', then something else is not right.

I will restart them manually ( they are in remote location right now ) and provide you with log files

Hello, I don't think I understand correctly what are you trying to achieve with a separate direct connector. May be if you explain your real task there I will be able to give you some other ideas. For example, there was a case, where new unknown nodes were automatically configured into the server as soon as they show up in the air, making use of the server-global events handler - events were received by an app via mqtt/amqp and node provisioning was performed via REST. There are actually many options out of the box, I am just not getting the real task :) Like, you can have an erlang app, that will monitor mnesia rxframes table via standard mnesia guard mechanism (no need to search) and do something as soon as new record being written into it.

Ok so i will explain it the right way ;) Problem -> after losing connection with gateway MQTT server stops receiving messages, but we dont know if its failure from node or gateway or net, that being obvious. Our system relies only on MQTT ( a tidy way to connect gateway with server being a passive participant in this system) . If we stop receiving we would like to gather missed frames for algorithm that feeds on them to ensure continuity of data on algorithm side. The only way to do this would be to make a whole decision service that would recognize and fetch missing frames. This would be complicated as you mentioned earlier. That being said it would also have to use another technology REST API. Why not create a client on server side -> with one gateway it doesnt matter, but when we will have more of them that whole decision making process will eat up resources. But implementing a LIFO buffer on a gateway MQTT client side would be a super simple way of increasing a communication reliability. Conclusion -> we are aware of missing uplinks and stuff that radio comm brings to the equation but it would be great if messages would stack up on MQTT client until it will recover his connection to sort of burst them into MQTT server. Only input order would act as a sorting key so every failed attempt to connect and publish in topic would be resolved by stacking info and unstacking it in a LIFO way

My task#1 -> find a way to ensure that rx frames are consistent on our database. My idea#1-> create a piggyback algorythm that will subscribe to our MQTT server and "translate" rest api actions into mqtt ( i dont like that ) My idea#2-> add a code into erland mqtt client to store messages in mnesia and periodically check for connections and then dump LIFO buffer.

My task#2-> create a way to deliver a server configuration ( all parameters stored in mnesia ) that are used for creating nodes and stuff, basically REST API capabilities but on MQTT to only publish a configuration on server side and bring functionality simmilar to autoconfig for gateway. My idea#1-> My task#1 -> My idea#1 Thank you

Enemoino avatar May 19 '21 10:05 Enemoino

@Enemoino Why won't you keep a mongodb on a server alongside your mqtt connector, storing all received uplinks there too? If your application will detect information loss, it may attempt to query the mongo database for the missing frames and get them exactly the same as they would have been transmitted over mqtt, and you can even mimic topics with collections. This is very much easier to implement than you propose above and there are plenty of mongodb client libraries out there to write any buffering-gateway application you need in any language you admire. You just only need to install mongodb on your network server, create a second connector to your application and point at mongodb. For historical reasons we keep mongodb connectors alongside AMQP connectors to support some python/mongo application code. And to make things even funnier, our typical node setup consists of a docker network with 4 containers: lorawan-server, rabbitmq, mongodb and prometheus.

This is how server config looks like: Снимок экрана от 2021-05-23 00-56-15

BTW, I opened PR #772 for periodic connector with 'network' failure restart attempts. At the moment, it will restart only connectors with single 'network' failure.

How is that debug.log getting on? And, to it, What version of Erlang/OTP and what version of the server are you running?

altishchenko avatar May 22 '21 21:05 altishchenko

Why won't you keep a mongodb on a server alongside your mqtt connector, Hi thank you, yes this is very good idea i was considerig it, only missing part is administrating server ( adding new nodes and etc). I will do that . Thank you for that

BTW, I opened PR #772 for periodic connector with 'network' failure restart attempts. At the moment, it will restart only connectors with single 'network' failure. Thank you for that, but i dont get 'network' failure but 'badarg' but i understand that we will get to it ;] why it should set network not badarg with log files

About that log, i will provide it to you tomorrow i decided to set up a "test" bench for the new code and first i will try to provide that log for you. I am using 0.7.0 and Erlang/OTP 21

Enemoino avatar May 24 '21 10:05 Enemoino

Hi have tested #772

1.compiles 2.starts 3.:8080 reachable 4.server configurable, adding connector 5.connector shows ok 6.after restart (system reboot ) stops responding

Systemctl status `● lorawan-server.service - LoRaWAN Server Loaded: loaded (/lib/systemd/system/lorawan-server.service; enabled; vendor p Active: active (running) since Fri 2021-05-28 09:04:27 BST; 4min 5s ago Main PID: 465 (lorawan-server) Tasks: 31 (limit: 2062) CGroup: /system.slice/lorawan-server.service ├─ 465 /bin/sh /usr/lib/lorawan-server/bin/lorawan-server ├─ 501 /usr/lib/erlang/erts-10.2.4/bin/beam.smp -Bd -- -root /usr/lib ├─ 529 erl_child_setup 1024 ├─ 636 sh -s disksup ├─ 638 /usr/lib/erlang/lib/os_mon-2.4.7/priv/bin/memsup ├─ 639 /usr/lib/erlang/lib/os_mon-2.4.7/priv/bin/cpu_sup ├─1153 inet_gethost 4 └─1154 inet_gethost 4 May 28 09:04:27 raspberrypi systemd[1]: Started LoRaWAN Server.

debug.log 2021-05-28 09:04:57.053 [warning] <0.471.0>@lorawan_connector_mqtt:terminate:298 Connector NAME_DEDUCTED terminated: {{bad_client_id,[{emqtt,assign_id,2,[{file,"/home/pi/testdev/lorawan-server/_build/default/lib/emqtt/src/emqtt.erl"},{line,1076}]},{emqtt,waiting_for_connack,3,[{file,"/home/pi/testdev/lorawan-server/_build/default/lib/emqtt/src/emqtt.erl"},{line,682}]},{gen_statem,call_state_function,5,[{file,"gen_statem.erl"},{line,1660}]},{gen_statem,loop_event_state_function,6,[{file,"gen_statem.erl"},{line,1023}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]},{gen_statem,call,[<0.472.0>,{connect,emqtt_sock},infinity]}} 2021-05-28 09:04:57.056 [error] <0.471.0> gen_server <0.471.0> terminated with reason: {{bad_client_id,[{emqtt,assign_id,2,[{file,"/home/pi/testdev/lorawan-server/_build/default/lib/emqtt/src/emqtt.erl"},{line,1076}]},{emqtt,waiting_for_connack,3,[{file,"/home/pi/testdev/lorawan-server/_build/default/lib/emqtt/src/emqtt.erl"},{line,682}]},{gen_statem,call_state_function,5,[{file,"gen_statem.erl"},{line,1660}]},{gen_statem,loop_event_state_function,6,[{file,"gen_statem.erl"},{line,1023}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]},{gen_statem,call,[<0.472.0>,...]}} in gen:do_call/4 line 177 2021-05-28 09:04:57.058 [error] <0.471.0> CRASH REPORT Process <0.471.0> with 0 neighbours exited with reason: {{bad_client_id,[{emqtt,assign_id,2,[{file,"/home/pi/testdev/lorawan-server/_build/default/lib/emqtt/src/emqtt.erl"},{line,1076}]},{emqtt,waiting_for_connack,3,[{file,"/home/pi/testdev/lorawan-server/_build/default/lib/emqtt/src/emqtt.erl"},{line,682}]},{gen_statem,call_state_function,5,[{file,"gen_statem.erl"},{line,1660}]},{gen_statem,loop_event_state_function,6,[{file,"gen_statem.erl"},{line,1023}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]},{gen_statem,call,[<0.472.0>,...]}} in gen:do_call/4 line 177 2021-05-28 09:04:57.064 [error] <0.464.0> Supervisor lorawan_connector_sup had child {mqtt,<<"NAME_DEDUCTED">>} started with lorawan_connector_mqtt:start_link({connector,<<"NAME_DEDUCTED">>,<<"Gateway-Cloud-Extended">>,<<"json">>,<<"mqtt://NAME_DEDUCTED:...">>,...}) at <0.471.0> exit with reason {{bad_client_id,[{emqtt,assign_id,2,[{file,"/home/pi/testdev/lorawan-server/_build/default/lib/emqtt/src/emqtt.erl"},{line,1076}]},{emqtt,waiting_for_connack,3,[{file,"/home/pi/testdev/lorawan-server/_build/default/lib/emqtt/src/emqtt.erl"},{line,682}]},{gen_statem,call_state_function,5,[{file,"gen_statem.erl"},{line,1660}]},{gen_statem,loop_event_state_function,6,[{file,"gen_statem.erl"},{line,1023}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]},{gen_statem,call,[<0.472.0>,...]}} in context child_terminated 2021-05-28 09:04:57.066 [error] <0.464.0> Supervisor lorawan_connector_sup had child {mqtt,<<"NAME_DEDUCTED">>} started with lorawan_connector_mqtt:start_link({connector,<<"NAME_DEDUCTED">>,<<"Gateway-Cloud-Extended">>,<<"json">>,<<"mqtt://NAME_DEDUCTEDp:...">>,...}) at <0.471.0> exit with reason reached_max_restart_intensity in context shutdown 2021-05-28 09:04:57.067 [error] <0.445.0> Supervisor {<0.445.0>,lorawan_backend_sup} had child connectors started with lorawan_connector_sup:start_link() at <0.464.0> exit with reason shutdown in context child_terminated 2021-05-28 09:04:57.068 [error] <0.445.0> Supervisor {<0.445.0>,lorawan_backend_sup} had child connectors started with lorawan_connector_sup:start_link() at <0.464.0> exit with reason reached_max_restart_intensity in context shutdown 2021-05-28 09:04:57.069 [error] <0.346.0> Supervisor {<0.346.0>,lorawan_sup} had child backends started with lorawan_backend_sup:start_link() at <0.445.0> exit with reason shutdown in context child_terminated 2021-05-28 09:04:57.070 [error] <0.346.0> Supervisor {<0.346.0>,lorawan_sup} had child backends started with lorawan_backend_sup:start_link() at <0.445.0> exit with reason reached_max_restart_intensity in context shutdown 2021-05-28 09:04:57.070 [info] <0.43.0> Application lorawan_server exited with reason: shutdown Server stops responding on browser

After removing connector.DCD from mnesia and restarting process server starts responding

Enemoino avatar May 28 '21 08:05 Enemoino

Now we are getting somewhere! Connector restart will not touch connectors with badarg, only <<"network">> failure, so the code in the PR for you didn't change anything from the previous server version. Corruption of the mnesia database on a raspberry Pi on system reboot seems to be an omnipotent feature, this comes from the way raspberry Pi (raspbian) improperly performs the shutdown procedure...

Now to the good part! Thanks for the log, it helps! Show me your connector config please, dbexport will be best. The problem is there in the main mqtt config somewhere I need to check it.

altishchenko avatar May 28 '21 11:05 altishchenko