graylog2-web-interface WAN friendliness

I wanted to file a ticket for this in order to at least track the thinking, but as you know I'm having real trouble making graylog2 work well when in a distributed environment (ie over two datacenters)

At the moment, if you add a server node with any WAN type latency, the application becomes unusable. API timeouts fail when pinging the nodes and the whole thing slows to a crawl.

When adding a radio node, we see the same issues. I realise that the app needs to do API calls on all the server and radio nodes in order to do healthchecks, but the whole thing becomes unusable when latency becomes a factor and I think this will prevent people with multi-datacenter setups from adopting.

At the moment we're having to consider multiple clusters across datacenters and it kind of defeats the purpose of centralized logging :)

Mar 19 '14 19:03 jaxxstorm

I had the same problem and recompiled the code with a higher timeout value. I think the default of 5 seconds is far too short. I was having constant timeouts.

Is there a reason why this isn't configurable in the graylog2 webserver config file?

Apr 23 '14 15:04 rgardam

How did you do this, rgardam? What did you set the timeouts to?

Apr 23 '14 15:04 jaxxstorm

Actually according to this doc (https://groups.google.com/forum/#!topic/graylog2/MJ8UmZeKIxk) you can increase it via the config, but from looking in the source I thought you couldnt.

What I did was set private static final Integer DEFAULT_TIMEOUT = 5; to 20

in https://github.com/Graylog2/graylog2-web-interface/blob/0.20/app/lib/Configuration.java

If you run the webinterface without sending it to the background you can see what it's complaining about.

Obviously if your elastic search is overloaded this should also be addressed.

Apr 23 '14 16:04 rgardam

this really needs work regarding the cluster organization. perhaps allowing to specify preferences or weights for servers can make most of this problem go away. also, it might be worthwhile to investigate whether we can have the current master relay commands the web interface needs to direct to individual servers.

Jun 03 '14 00:06 kroepke

I haven't checked the latest radio node behaviour, but is it still poor over the WAN?

Jun 03 '14 13:06 jaxxstorm

the basic behaviour has not changed, no.

Jun 03 '14 13:06 kroepke

Hi,

Would using something like the ES Tribe node to aggregate queries across multiple ES clusters work to alleviate this issue by providing a centralized view? One would still need to setup a graylog-server / ES cluster in each datacenter to house the data (which is fine with me, and suits my data segregation needs nicely).

-n

May 12 '15 01:05 nathanhruby

It would solve part of the problem, however graylog nodes do not like high latency between them, which is the root cause of this issue.

Eventually we will address this, but honestly it is not high on the list right now.

May 15 '15 10:05 kroepke