icinga2
icinga2 copied to clipboard
[dev.icinga.com #10118] Evaluate a way of reducing the traffic between cluster nodes
This issue has been migrated from Redmine: https://dev.icinga.com/issues/10118
Created by teclogi on 2015-09-08 07:47:20 +00:00
Assignee: (none) Status: New Target Version: (none) Last Update: 2016-03-18 17:26:42 +00:00 (in Redmine)
Backport?: Not yet backported
Include in Changelog: 1
For environments with mobile connection or satellite link (for example ships) it would be very helpful to reduce the traffic between master and satellite. By now Icinga using about 750 MB / month only for keep alive traffic between master and satellite which can be very expensive on a connation with a traffic limit (for example a satellite link, mobile connection). Maybe there is a way to implement an on-demand connection between master and satellite.
So it would be helpful to have some options to reduce this traffic like:
- retry-interval
by now on connection loss the satellite tries every 2 minutes 3 times to connect to the master !
- off-time
to schedule off times
- on-demand ( true / FALSE )
only connect to master if required
Updated by mfriedrich on 2015-09-09 08:19:03 +00:00
- Subject changed from Reduce trafic between master and satellite () to Evaluate a way of reducing the traffic between cluster nodes
- Category set to Cluster
Updated by mfriedrich on 2016-03-18 17:26:42 +00:00
- Priority changed from Normal to Low
I know a way to reduce the traffic w/o breaking anything:
Send heartbeats every 60s, not every 10s.
Heartbeats don't include much traffic in bytes, the most invasive bits source from large check result outputs and performance data. Without having the heartbeat in place that often, things like the replay log won't work reliably, neither would the HA failover.
In terms of compression, we would need to take the following things into account:
- zlib as dependency
- New cluster protocol version, this likely is incompatible with everything else
- Binary format for messages, even harder to debug them over the wire
- Decompression and compression takes more CPU resources than plain text strings
- Low profile hardware may suffer performance and late events from the compression resources
What about using Google's protocol buffers: https://developers.google.com/protocol-buffers?
That would mean moving away from JSON-RPC and replacing the entire cluster protocol. That's currently not an option in 2.x.
Will be considered for future major versions of Icinga, 3.x+.
Idea
- Every node includes in its hello message that it supports gzip
- Every Endpoint has a local state
Endpoint#has_gzip - Every node updates the peer's
Endpoint#has_gzipbased on its hello message - On every outgoing cluster connection we look whether the peer supports gzip and if yes, we wrap the NetString stream into a gzip stream
- On every incoming cluster connection if the first byte is 0x1F, we I/O NetString over gzip
@lippserd @N-o-X @htriem Please comment.
I like the idea of using gzip here, but I'm not sure if this issue currently has high priority. I think many network related issues come from things like endless config sync and issues with the replay log. Correct me if I'm wrong.
Sure. But as the author said on ships every KB is a "network related issue".
I don't think that we switch to another protocol in the near feature. So this feature absolutely makes sense to me 👍
@pdolinic @julianbrost Any security objections regarding SSL compression (instead of #7994)?
Compression was dropped in TLS 1.3. Also, implementing this is somewhat dangerous as this can leak information.
... e.g. ...?
For example https://en.wikipedia.org/wiki/CRIME and https://en.wikipedia.org/wiki/BREACH. The underlying issue will probably also exist here as well.
Jup Julian said it all - same thing for VPNs , nobody uses compression there anymore - because of increased predictability of block rounds because of this, but the latest ciphers included in TLS 1.3 Standards (if you don't need backwards compatibility).
First of all: I like the proposal made by @Al2Klimov. Not sure about the tab-magic. While it would work, I'd prefer to see a full capability exchange at connection setup. Optional of course, for compatibility reasons. This might also help to address other issues, please contact me for suggestions of what else to exchange at hand-shake time.
As again and again the idea of replacing the cluster protocol is popping up, please let me add some more related thoughts:
- Before taking any decision, PLEASE measure traffic in a real-world setup. Measure at runtime, and measure when deploying new configuration. Measure between masters, masters and satellites, agents and their parent instances.
- Simulate one of the scenarios the issue author explained: have a large configuration, disconnect your satellite every 2-3 minutes for a couple of seconds. Watch what we're sending at each reconnect. How would you address what you're seeing?
- Pick a scenario with many agents, simulate a setup deploying a configuration multiple times an hour. What happens, when there are changes in global zones? What happens, when there are changes in the top zone only? Name the unnecessary parts in our payload. How can this be improved?
- Take your conclusions and try to address the underlying problems. If unsure: please ask. There have been very concrete proposals of how to address this years ago
- Try to contradict this statement: NetString and JsonRPC are both protocols with very low overhead when it goes to sending text-based contend
- Compression would be an improvement. As our payload is mostly UTF8 text, I'm absolutely in favor of using gzip compression for our payload. But this comes AFTER fixing all the unnecessary (and huge) noise
- Try to answer for yourself: how should any protocol reduce traffic in a notable way, when completely skipping compression? If the payload of our packets wouldn't change, how could there be notably less traffic?
- Consider the amount of work replacing the cluster protocol would involve. Try to name all the side-effects it has. Involved components, compatibility, documentation, upgrade paths. Replacing the transport in the source code is trivial, but that's only a very small part of that to expect here. Compare the result to what we would gain from doing so.
As you can read between the lines: I'm strictly against replacing the protocol. You risk to waste a lot of time by addressing the wrong issues. Instead, please try to improve what we have. I'm absolutely in favor of fixing weak points in our cluster communication. Where are those weak points, how can this get better? Here some initial tasks:
- do not send configuration the other node already has
- do not send configuration the other node doesn't want
- check whether there are pointless messages in our replay log
- add compression
- do not fear the attack vectors mentioned above, try to understand them
- optionally try to mitigate similar attacks, even if they do not (yet) exist
Cheers, Thomas
- do not fear the attack vectors mentioned above, try to understand them
I do, and thus I think that if compression was implemented, it would have to be an opt-in feature with a warning that it poses a risk to confidentiality. And given that it would need such a warning, I have doubts that it's good idea to implement at all.
Please do not let those attack vectors confuse you. Payload compression is fine, TLS compression is not - I think we all agree on this. Why should compressed data be evil per design? Can we no longer mail compressed attachments? Try to understand how those attacks work, and you'll feel much better.
To address above security concerns:
@julianbrost is right about TLS compression, I wouldn't use it at the time being. Guess no one would.
...nobody uses compression for VPNs anymore
Sources? IMHO this is not true. Contrary, I'd argue that it is hard to find uncompressed VPN connections in the real world. By "real world" I mean enterprise setups, site to site VPNs and so on. Attacks like Voracle work in the lab, and for very, very specific scenarios. You can attack the OpenVPN connection you're using for your personal web browser. You'll fail for most compressed real-world VPNs.
Also, Icinga... is not a Web browser. If in doubt, please pick CRIME, BREAST or VORACLE, study how they work and try to sketch an attack for Icinga. You'll not get very far I guess, but I'm always eager to learn. So please, do not hesitate to prove me wrong on this.
Being paranoid is a good thing. If you want to introduce an additional generic protection layer against known plaintext attacks in general: start with space-padding our requests. For even more randomness: add an additional nonce-like field to the JsonRPC payload at the document root level. Add randomness to our Request IDs. Once you did so, my gut feeling is telling me that adding gzip compression to the would not hurt. Blind guess: it would even help improving randomness, as more bits are being involved.
All of this is not required, fixing TLS is not our job. But adding an extra safety barrier against upcoming attack vectors wouldn't hurt. Also, re-evaluate our defaults from time to time. Choosing Cipher suites, enforcing PFS and so on should of course be allowed to the experienced user. But it's our job to ship good and secure defaults. Of course always in the range of what the requirement of not breaking compatibility with older deployed Agents allows.
Last but not least, there are again and again issues related to Please allow to use an external CA. We should add a big fat note to our documentation stating that this is considered very bad practice and insecure per design. The companies asking for this are 90% banks. Their requirement is NOT more security, at least not how you and me think about security. They are usually deploying their CA for the solely purpose of being able to decrypt ALL their encrypted connections for... reasons. It might feel stupid to you, but that's what their security teams are practicing. Sniffing every encrypted connection of all employees is what the call DLP (Data Loss Prevention).
And please do not forget about the original issue: excessive traffic for a very specific scenario. A ship loosing it's connection again and again, being hammered with useless data on each reconnect. Fixing this would be wonderful, as we're generating a lot of useless traffic in the datacenter too.
Forget my proposal. It's a dead end as custom ASIOers are PITA.
Payload compression is fine, TLS compression is not - I think we all agree on this.
No we don't.
and try to sketch an attack for Icinga. You'll not get very far I guess, but I'm always eager to learn. So please, do not hesitate to prove me wrong on this.
The core issue is that you should not compress confidential data together with attacker-controlled data because otherwise, an attacker can change its part of the payload to guess parts of the confidential data. Once the size of the compressed data decreases, they know they guessed a part because their guess is now redundant information and allows a higher compression rate. Incrementally, they can guess larger parts of the confidential data.
How could this affect Icinga? Look at a check result. It contains the command line (which might contain sensitive information) and the output (which depending on your setup might be attacker-controlled).
Sou you opt against #7994?
Sou you opt against #7994?
Haven't looked at it in detail. If it enables it by default for each connection, yes. If it's an opt-in feature, you might argue for long enough that there might be scenarios where this isn't an issue for you but the need for reduced data transfer is real.
I think the better (but also more time-intensive) approach would be to try to avoid sending redundant data instead of strapping on a layer of compression to reduce the impact of that redundant data. This would also benefit all and not just those who enable this feature.
If it enables it by default for each connection, yes.
OK, feel free to close it. (Security is a more honor reason than difficulty. ;-) )
Wait! What if we compress message-by-message?
As mentioned in https://github.com/Icinga/icinga2/issues/3387#issuecomment-982472433, you can get into a problematic situation with just a single check result. Yes, resetting the state of your compression algorithm helps, but getting this right is probably hard.
Oh man! Why all this has to be that hard?

As we're on the topic: regarding the ships mentioned by the OP: would you agree that special situations require special monitoring tools?
Without being a Dev nor knowing about all the technical complexity of Icinga2 (yet) - before compression, optimizing transmission, woudn't it make sense to only send back the bare_minimum: 1) the check-exit-return values (0,1,2,3) plus 2) a mapped Host-ID? That would shrink delivery, and then another crazy idea would be if nothing changes on the agent, he wouldn't have to send anyway (but you would require some kinda external heartbeat just to make sure he is up, but this could be any other system of choice). How all of this technically works, no idea :D Was just trying to provide some brainstorming.