aquarium Unresponsive processes not restarted

Can the processes that become unresponsive and are eating 100% CPU be restarted? They are marked as DOWN by haproxy: 2016-05-13 12 17 44

As far as I understand, haproxy just handles the requests and redirects them to alive workers, and it is someone else job to restart them. But docker can not restart them because it does not know that they are down, right? I'm know that for example uwsgi has such a feature but perhaps docker and haproxy can be made to behave in a similar way?

May 13 '16 09:05 lopuhin

Thinking about it more, perhaps it would be better to fix it on the splash side - this way it will benefit anyone running a single splash instance.

May 13 '16 09:05 lopuhin

Yeah. Even worse, even if you kill them and restart manually they are not picked up by haproxy because haproxy docker container doesn't detect new ip addresses of linked containers; I think this is the first issue to solve. It should be possible to fix since haproxy 1.6 (http://blog.haproxy.com/2015/10/14/whats-new-in-haproxy-1-6/) - it now provides a way to resolve hostnames at runtime. I haven't got to it yet; help is appreciated :)

May 13 '16 09:05 kmike

Are you thinking about a builtin watchdog process?

May 13 '16 09:05 kmike

More like a feature of a splash instance that would check if rendering is taking too long and will try to kill the call to the browser engine and re-initialize it. I'm not sure if this is possible.

May 13 '16 09:05 lopuhin

There is 'timeout' feature, but it works only if rendering yields to the event loop, which is not the case when we hit some qt bug.

May 13 '16 09:05 kmike

Yeah, maybe we could launch some thread or use signals here. If it is possible to kill and restart the browser engine, it should be pretty robust.

May 13 '16 09:05 lopuhin

It could be tricky because we're running several QWebViews in a same event loop => all concurrent renders happen in a single thread; there may be some tasks launched by qt executed in different threads, but AFAIK by default most things we have control of are single threaded. There could be a workaround (it worths checking), but I'm a bit pessimistic. Aquarium is a Splash with multi-processing.

May 13 '16 09:05 kmike

I see, thanks! I'll try to check if there is a way :)

May 13 '16 09:05 lopuhin

Another option that seems to be easier and seems to also solve the problem for Aquarium (with the new haproxy feature that you mentioned) is to exit if processing takes too much time, similar to --max-rss option to splash.

May 13 '16 09:05 lopuhin

So, a watchdog process? Also, it seems these checks can be implemented in haproxy itself (it can run external scripts), but I haven't checked the details.

May 13 '16 10:05 kmike

What I meant was more like a watchdog thread inside the splash server, but now I understand that this is not that good - it would be complicated to do anything except for killing the process in this case, and it does not bring much benefit for the standalone splash, only complicates it. So it seems more reasonable to do it in Aquarium. This external-check Haproxy feature can be used to kill the server that does not respond to ping, perhaps.

May 13 '16 11:05 lopuhin

Any update on this?

Jan 05 '17 16:01 Glennvd

@Glennvd no updates, sorry.

Jan 05 '17 17:01 kmike

@Glennvd I've been having the same issue.

In my case, I discovered that Xvfb is the process failing and thus not letting splash do its job.

If you log into one of the DOWN containers with

docker-compose exec splash6 bash

and then

ps ax

you should see something like

PID TTY      STAT   TIME COMMAND
    1 ?        Ssl    0:18 python3 /app/bin/splash --proxy-profiles-path ...
    8 ?        Z      0:00 [Xvfb] <defunct>
   48 ?        Ss     0:00 bash
   67 ?        R+     0:00 ps ax

Now, by simply restarting the Xvfb process with Xvfb :100 -screen 0 1024x768x24, if will recover.

The question is, how do we manage to make Xvfb recover automatically? I'm going to file an issue on the Splash repository as this isn't a bug in aquarium.

Jan 10 '17 09:01 ale316

Sorry, this actually seems to be a side effect of the machines crashing. If I log into one of the crashed container immediately after it dies, Xvfb seems to still be up and running. After a while it crashes and I guess restarting it triggers a Splash reload.

Jan 10 '17 10:01 ale316

Lopuhin, what did you end up doing about this issue? I'm thinking of just setting a cron job to restart the docker containers a few times a day.

Feb 26 '18 02:02 landoncope

@landoncope I did the same... crontab that restarts every 1 hour (yes, had to get that frequent).

Hopefully there's a better fix.

Mar 21 '18 01:03 cristianocca

I am also having the same issue. Have automated 1.5L URLs to fetch the page source with 5 splash instance Aquarium. It takes almost 15 hours to complete. After completed a few instances went down and these instances were not restarted.

Is restart the Aquarium only way to solve this issue?. Any other solution is there?

Nov 13 '18 06:11 Mideen

I'm not aware of any new solutions, but with splash 3.2 reliability is much better in our experience.

Nov 13 '18 06:11 lopuhin

Is there any API Call to check externally whether the splash instances are down or not instead of viewing the HAProxy statistics?. If there, we can restart the Aquarium only if an instance is down.No need to restart periodically.

Nov 13 '18 06:11 Mideen

What's the way to restart this, say, once every 4 hours?

Aug 24 '22 16:08 davidkong0987