gluon icon indicating copy to clipboard operation
gluon copied to clipboard

gluon-online-status concept

Open mweinelt opened this issue 3 years ago • 23 comments

As we closed #1930 and #1684 today with a reference to an IRC discussion that was ongoing I want to present the conclusion of that discussion.

We think that having all nodes ping into the world in regular intervals is not something we would like to see in a first implementation of this feature. Instead we would like to focus on information that we can cheaply derive from the local node and offer a multitude of flags to various packages, that they can easily test for to see if the node is or isn't in a required state.

For that we define a directory /var/gluon/online/ that carries empty marker files. With our initial proposal we think of two simple markers that we would like to see in the first version:

  • neighbors or mesh to reflect that the node has neighbors that it meshes with
  • route_default4/6 or default_gw4/6, to reflect that the respective network stack has a default route

In later versions these can be extended by a multitude of things, I could imagine exposing whether we have an active NTP sync for example. This would also be open to contributions of markers through community packages.

A script should run in a regular fashion, that calls the canonical checks for the mesh protocols provider and based on these touches certain files, should they not exist, or deletes them should they no longer be valid. I think it would be helpful not to touch already existing markers so that we can look at the mtime of these markers to know when they entered a certain state.

For batman-adv we think batctl n and batctl gwl are strong contenders, and for babeld something can be grepped from the dump command of its control socket.

N.B: I'm not sure what state the babel setup is in, and I wouldn't want to make it a mandatory part of the implementation if, as I have recently heard, it doesn't build for multiple releases now and nobody noticed.

@NeoRaider @blocktrron @AiyionPrime @T-X I hope this summarises what we talked about, if not feel free to add your larification below.

mweinelt avatar Jun 08 '21 01:06 mweinelt

I think it would be helpful not to touch already existing markers so that we can look at the mtime of these markers to know when they entered a certain state.

On second thought: we should have ctime to read that information, so touching could be done unconditionally.

mweinelt avatar Jun 08 '21 15:06 mweinelt

Suggestion:

/var/gluon/state/has_neighbours
/var/gluon/state/has_default_gw4
/var/gluon/state/has_default_gw6

(React with: 👍 or 👎)

lemoer avatar Jun 08 '21 23:06 lemoer

It's correct that batctl gwj and batctl nj are not soon available in gluon/tree/master right?

AiyionPrime avatar Jun 09 '21 10:06 AiyionPrime

So how will this solution help creating an SSID-Changer? How can we be sure, that the node can reach the internet?

rubo77 avatar Jun 09 '21 15:06 rubo77

It's correct that batctl gwj and batctl nj are not soon available in gluon/tree/master right?

As soon as we are on OpenWrt 21.02, but apparently we have not yet decided to migrate to that after 2021.1, which I honestly can't understand because 19.07 has a projected EOL in august.

mweinelt avatar Jun 09 '21 15:06 mweinelt

So how will this solution help creating an SSID-Changer? How can we be sure, that the node can reach the internet?

@rubo77 Not sure it will. For now I am aiming for the three paths above; a ressource-check using icmp and external servers might be added later, as soon as the current goal proves to be insufficient.

And neither have I forgotten: there was a huge discussion about what an online-checker might check; nor that one of the results was for offline-ssid-changer a ping check is expected to be most useful.

It's just so controversial, that I'd like to start with a featureset we can agree on. There are other usecases, for which the provided cheaper calls might be enough. We can start with them and discuss the implementation of the expensive ones later.

AiyionPrime avatar Jun 09 '21 16:06 AiyionPrime

So how will this solution help creating an SSID-Changer?

at the moment there is no way to let batman-gw signalize if it has public ipv6/ipv4 connectivity available through it. as long as this is, the offline ssid package should still test "pinging public internet".

Adorfer avatar Jun 09 '21 16:06 Adorfer

So how will this solution help creating an SSID-Changer?

at the moment there is no way to let batman-gw signalize if it has public ipv6/ipv4 connectivity available through it. as long as this is, the offline ssid package should still test "pinging public internet".

Why not

# untested, but hope you get the gist
IFNAME=bat0 DEST=192.0.0.2 (ping -c4 -I $IFNAME $DEST && batctl meshif $IFNAME gw server 100mbit/100mbit) || batctl meshif $IFNAME gw off

in a cronjob on your gateway, and then we have signaling based on the gateway mode and you even have some sort of failover condition.

There are smarter solutions than having hundreds of nodes ping into the internet all day long.

mweinelt avatar Jun 09 '21 17:06 mweinelt

Hanover does this as well.

AiyionPrime avatar Jun 09 '21 19:06 AiyionPrime

Why not

this is what we do, a little more complex, which i consider just a workaround. Nevertheless i assume that there are people arguing that "being a batman-gateway should not be the indicator for offering internet-peering". Or to put a a different way: i wish there would be a dedicated way for a offline-ssid-package to determine if there is internet connectivity available via the batman-network, for example via dns (a dnsbl-alike method)

Adorfer avatar Jun 09 '21 21:06 Adorfer

What is the goal you want to achieve with this gluon-online-status? Better say for what do you want to collect this information?

I think gluon-online-status is a bad naming as it suggests that this can actually check whether the internet is reachable which only a global ping check could and it does not make clear that this package also does neighbor checks etc.

My idea in https://github.com/freifunk-gluon/community-packages/pull/9 is that it is up to the community decide on which level they want to do the test. It is possible to define target groups e.g. local (supernodes, nameservers, timeservers) and global ones (publicly pingable servers, servers from other Freifunk communities or something). The offline-ssid package can then be set to use either the global or local targets depending on if the community thinks it is a bad idea to ping global targets.

My approach could be clearly made more efficient, but it was thought as an RFC right from the beginning. For example one could do a local respondd query to determine if another node in the local mesh cloud has done a connection check successfully so that only one of the nodes does the ping to global targets or something.

I just don't understand why you want to add this whole complexity. From my point of view ping checks are not costly and checking for neighbors and route_default4/6 would only add to complexity without a proper use-case as a ping would not cost anything if these are not given (because it would not reach the internet at all then).

CodeFetch avatar Jun 10 '21 12:06 CodeFetch

The first third of checks we initially agreed on is now part of master. The other two would (at least for batman-adv) follow in #2274.

Babels implementation is still missing; if there's interest I can write them down as well.

AiyionPrime avatar Aug 11 '21 08:08 AiyionPrime

@AiyionPrime What about adding a respondd provider for publishing the gluon-state results?

CodeFetch avatar Aug 11 '21 17:08 CodeFetch

Personally I do not see a need for that.

I was made aware of the goal to fit repondd responses in a packet.

[...], the data should fit in a single unfragmented packet.[...]

Originally posted by @NeoRaider in https://github.com/freifunk-gluon/gluon/issues/2289#issuecomment-895578315

That's something I did not focus on yet, but will in the future. Other than that such a provider is out of this' issues scope. If you need it, you might want to open another.

What's left to do here is #2274 as well as possibly a pendant for babel. Maybe @mweinelt did come to a conclusion; the initial post questions this in the italic section.

AiyionPrime avatar Aug 11 '21 18:08 AiyionPrime

@AiyionPrime A respondd provider for this would be very useful for diagnostics in the future (maybe not the currently implemented) and I think the overhead is negligible as these are one-byte values... Isn't there a possibility to include the requested fields in the query? And if not... Why? E.g. the total of transferred bytes or the hostname etc. is something which doesn't need to be transferred each time or only if it is changed. If respondd is at it's size limit, respondd should be optimized...

CodeFetch avatar Aug 11 '21 18:08 CodeFetch

I really think query-dependent respondd-responses are out of scope of this.

AiyionPrime avatar Aug 11 '21 20:08 AiyionPrime

I'm not sure what check for has_default_gw4 in babel would make sense.

AiyionPrime avatar Aug 13 '21 06:08 AiyionPrime

@AiyionPrime I don't have access to a Babel-based mesh at the moment. Could you post the output of a babel dump?

neocturne avatar Aug 13 '21 14:08 neocturne

I haven't either; will try to find a working net again.

AiyionPrime avatar Aug 15 '21 17:08 AiyionPrime

hm, @mweinelt was, as far as i remember, under the impression that recent Gluon releases don't build or at least don't work with babel, as no one has worked on or at least tested the babel support in the last ~2 years

rotanid avatar Aug 17 '21 00:08 rotanid

I haven't found a working net with recent gluon-babel in the past months, so this is kind of stalled by #2353 .

AiyionPrime avatar Jan 14 '22 10:01 AiyionPrime

I think we need a solution here soon because the current ssID changer doesn't work anymore with the latest gluon and needs to be adapted

rubo77 avatar Mar 10 '22 15:03 rubo77

I think we need a solution here soon because the current ssID changer doesn't work anymore with the latest gluon and needs to be adapted

I think the solution has been implemented a while ago. For batman networks the work is done since the merge of #2245 and #2274. So in case your community uses batman, you're already good to go.

In case it does not and uses babel instead, you only have a v6 default route check for now. You can however help get #2297 merged, by testing how well it works.

Code has been done about a year ago, we haven't encountered anybody with a recent babel network yet...

AiyionPrime avatar Mar 10 '22 18:03 AiyionPrime