learn-ocaml Feature: new option à la `--replace` to leverage this existing feature in a `docker restart` context

trafficstars

Related user(s):

@erikmd, @AltGr

Related issue(s) or PR(s):

#567

Related project scope(s):

No response

The problem:

learn-ocaml 1.0.0 's learn-ocaml build serve --replace is very nice :100:
but in a docker-compose context, on an exodir update, we don't successively start 2 learn-ocaml procs
(anyway, there's the 1-proc-per-container rule!)
we rather use docker compose restart, and we would really like to benefit from the --replace semantics in this case

Wanted solution:

add an alternative option to --replace, e.g. --fork-replace, or --smooth-replace

Considered alternatives:

(add an additional option, passed along with --replace, e.g. --fork or --quick-restart)
(but issue if --fork is passed without --replace)

Additional context:

We should think about ENV VARS as well! (e.g., a single environment variable that can be set to {}="0", "1" or "2")

Feb 29 '24 23:02 erikmd

I am not sure the executable itself can do anything in the case when it is containerised in this way (I don't understand what --fork-replace would do ?): AFAIK replacing a containerised server with minimal downtime is one of the core functions of any load-balancer / orchestrator: start new container, wait for it to start accepting connections, start routing incoming connections to it, kill older container -- was'nt that available before I added the option for the stand-alone case ?

Maybe the new option would help with the management of stored state during such transitions ?

Mar 01 '24 12:03 AltGr

Hi @AltGr, thanks a lot for your comment! Indeed, the overall aim is to have a Dockerised server with minimal downtime.

I agree that this may be possible to implement this by using a Docker Swarm or Kubernetes balancer, several instances sharing the www folder, but this looks quite difficult to me to implement though. And in usual teaching context, we only have ~100 students, where a single instance using Docker Compose already scales for that, while the --replace option can't be used in this context IINM. Indeed:

in general we only have 1 main process per container,
and if we spin two different instances, the www folder is not shared between containers, nor the TCP interface (each container has a localhost:8080 address which is local).

So my idea was to support a similar CLI option, to offer this high-availability feature easily within the app's logic, namely:

at container startup, the default command learn-ocaml build serve --repo=/repository --fork-replace is run;
if the www folder already exist (and should be rebuilt), it forks its own process; the first one serves the existing www, and the second one behaves as learn-ocaml build serve --repo=/repository --replace;
if the www folder doesn't exist, the option --fork-replace is ignored and no fork occurs.

→ What do you think?

Have a nice WE!

Mar 01 '24 18:03 erikmd

Maybe the new option would help with the management of stored state during such transitions ?

I didn't think about this (it's another topic!): maybe there will also be sth to do to get a safer --replace; I don't know.

Mar 01 '24 18:03 erikmd

Ah! I understand better your intent. Wouldn't [ -d www ] && learn-ocaml serve & learn-ocaml build serve --replace work then ?

Mar 04 '24 09:03 AltGr

Hi @AltGr! Mostly yes, this could work. (Except that [ -d www ] && … would yield a bad exit code if www doesn't exist.)

Actually when Docker entrypoints come into play, I believe it's always slightly better to avoid spinning a shell, to ensure that the entrypoint binary always receives proper signals (see e.g. https://hynek.me/articles/docker-signals/)

So, what do we decide?

Option 1. Just keep learn-ocaml CLI as is, and change the Docker entrypoint. (I downvote this one because of signals.) Option 2. Implement --fork-replace with Unix.fork along with the current implementation of --replace. Option 3. Implement --fork-replace with Unix.fork and with Unix.kill (given we know the PID of the child process)

And if ever Option 2. or 3. looks good to you, would you prefer to implement it, or to review the PR ? :)

Mar 04 '24 13:03 erikmd

After reading https://hynek.me/articles/docker-signals/ I still think signals should work, since we already use dumb-init... it may be worth testing anyway. Otherwise no strong objection to option 3 (maybe with a more high-level name than fork-init, that doesn't convey the intent: something like --serve-right-away would be more explicit). (also, be very careful with forking when any lwt stuff is going on, it's very easy to have extremely weird stuff happening even if you tried to clean up all lwt threads in the child)

Mar 22 '24 13:03 AltGr

Thanks @AltGr !

After reading https://hynek.me/articles/docker-signals/ I still think signals should work, since we already use dumb-init...

AFAIU, it would be very OK if the shell is ephemeral (using the exec builtin), which is not the case.

it may be worth testing anyway.

Sure, I can volunteer to do more tests.

Otherwise no strong objection to option 3 (maybe with a more high-level name than fork-init, that doesn't convey the intent: something like --serve-right-away would be more explicit).

OK, good point.

(also, be very careful with forking when any lwt stuff is going on, it's very easy to have extremely weird stuff happening even if you tried to clean up all lwt threads in the child)

OK. Would you have a few suggestions in mind, regarding the things to monitor ? maybe, with a dedicated tool ?

And do you think that the potentiel weird things that can occur would be specific to the forking ? or would they already occur with the current implementation of --replace ?

Have a very nice WE!

Mar 22 '24 18:03 erikmd

Ah indeed, the Dockerfile ENTRYPOINT is ["dumb-init","/usr/bin/learn-ocaml","--sync-dir=/sync","--repo=/repository"], which doesn't go through a shell, but we'd need to interspede the snippet between the dumb-init and the learn-ocaml process. Of course we could exec the second run of learn-ocaml but we'll still run into problems if one tries to interrupt while the "temporary server" is still active (and that one will have to be in a subshell anyway, even if we exec it).

I'm still pretty sure it can be done... but the balance of which approach is simplest might be changed :)

As long as you fork before running Lwt_main.run you'll be fine. After that, you can use Lwt_unix.fork but if there is any lwt calls within the child, extremely bad and confusing things will happen. It's just fork, replace just kills an external process, which is safe.

Mar 25 '24 10:03 AltGr

learn-ocaml learn-ocaml copied to clipboard

Feature: new option à la `--replace` to leverage this existing feature in a `docker restart` context

Related user(s):

Related issue(s) or PR(s):

Related project scope(s):

The problem:

Wanted solution:

Considered alternatives:

Additional context:

learn-ocaml
learn-ocaml copied to clipboard