WAN friendliness
I wanted to file a ticket for this in order to at least track the thinking, but as you know I'm having real trouble making graylog2 work well when in a distributed environment (ie over two datacenters)
At the moment, if you add a server node with any WAN type latency, the application becomes unusable. API timeouts fail when pinging the nodes and the whole thing slows to a crawl.
When adding a radio node, we see the same issues. I realise that the app needs to do API calls on all the server and radio nodes in order to do healthchecks, but the whole thing becomes unusable when latency becomes a factor and I think this will prevent people with multi-datacenter setups from adopting.
At the moment we're having to consider multiple clusters across datacenters and it kind of defeats the purpose of centralized logging :)
I had the same problem and recompiled the code with a higher timeout value. I think the default of 5 seconds is far too short. I was having constant timeouts.
Is there a reason why this isn't configurable in the graylog2 webserver config file?
How did you do this, rgardam? What did you set the timeouts to?
Actually according to this doc (https://groups.google.com/forum/#!topic/graylog2/MJ8UmZeKIxk) you can increase it via the config, but from looking in the source I thought you couldnt.
What I did was set private static final Integer DEFAULT_TIMEOUT = 5; to 20
in https://github.com/Graylog2/graylog2-web-interface/blob/0.20/app/lib/Configuration.java
If you run the webinterface without sending it to the background you can see what it's complaining about.
Obviously if your elastic search is overloaded this should also be addressed.
this really needs work regarding the cluster organization. perhaps allowing to specify preferences or weights for servers can make most of this problem go away. also, it might be worthwhile to investigate whether we can have the current master relay commands the web interface needs to direct to individual servers.
I haven't checked the latest radio node behaviour, but is it still poor over the WAN?
the basic behaviour has not changed, no.
Hi,
Would using something like the ES Tribe node to aggregate queries across multiple ES clusters work to alleviate this issue by providing a centralized view? One would still need to setup a graylog-server / ES cluster in each datacenter to house the data (which is fine with me, and suits my data segregation needs nicely).
-n
It would solve part of the problem, however graylog nodes do not like high latency between them, which is the root cause of this issue.
Eventually we will address this, but honestly it is not high on the list right now.