reflow
reflow copied to clipboard
basic architecture doc
Thanks for opening this up and the nice docs so far. I am wondering if a basic architecture doc focused on infrastructure and runtime can be produced. Things that come to mind:
- how does client job submission work?
- what persistent server processes are there, and what information is persisted?
- what agents run on EC2 instances
- what are the requirements for the instance (a particular AMI, certain software), is there a way to SSH to them? Can new software (e.g. monitoring containers) be layered on top?
- how do the docker jobs get submitted?
- how are the job results relayed on completion?
- how does the runtime handle downloading S3 files in detail (does it wait to receive the full file)?
- do instances maintain caches of files from previous runs or do they get wiped?
Hi @gregwebs, thanks for your comment. We have such a document internally, and I plan on formatting it for external use. (And also make sure it's up to date with reality.)
Also, to address some of your specific questions:
-
how does client job submission work? Reflow does not do asynchronous job submission. Think of Reflow like you would a program language interpreter (e.g., Python), you run
reflow run foo.rf
to run your program. It just happens to be able to use a large EC2 cluster to do its work. Asynchronous submission is an orthogonal concern: that being said, we are planning on providing such a feature natively in Reflow. - what persistent server processes are there, and what information is persisted? For each instance that Reflow starts, there is currently only one persistent process: the reflowlet (Reflow's "agent" process), which is responsible for managing the machine's resources and coordinate requests from multiple Reflow evaluators. The only information that is persisted is cache metadata and data, if configured.
- what agents run on EC2 instances? See above.
-
what are the requirements for the instance (a particular AMI, certain software), is there a way to SSH to them? Can new software (e.g. monitoring containers) be layered on top? The requirements are that Docker is available, that the AMI uses systemd, and that it can be configured via cloud-config. In its default configuration, Reflow uses the standard CoreOS system images. You can ssh into them: when you run
reflow setup-ec2
, it captures your public ssh key in the configuration that is produced. This is spelled out in the README: "You can ssh into the host in order to inspect what's going on. Reflow launched the instance with your public SSH key (as long as it was setup by reflow setup-ec2, and $HOME/.ssh/id_rsa.pub existed at that time)." - how do the docker jobs get submitted? The reflowlets communicate directly with the docker daemons running on the instances.
-
how are the job results relayed on completion? When a Reflow program completes, the value that
Main
evaluates to is printed. However, this is seldomly useful by itself. Usually the top-levelMain
is some side-effecting computation, typically copying the results to some desired destination. See 1000align for an example. We have some (I think) exciting plans to address this, too, by allowing Reflow modules to expose virtual file trees directly, and making file operations available via thereflow
command. -
how does the runtime handle downloading S3 files in detail (does it wait to receive the full file)? Yes, Reflow currently runs an exec only when its prerequisites are fully evaluated. The current contract is that a
file
is a file, and adir
is a directory. We're working on introducing a form of steaming into Reflow so that anexec
may receive (and produce) data incrementally. This would mean that files are materialized into Unix streams instead. -
do instances maintain caches of files from previous runs or do they get wiped? they do, for some time. File repositories are associated with specific allocs (resource allocations) on the individual machines, and their lifecycle is tied to the alloc's lifecycle. Allocs are collected only when they become idle (i.e., no evaluator is maintaining a lease to the alloc) and their resource reservations are needed by another evaluator. It's possible to explicitly reuse an alloc by passing the
-alloc
flag to Reflow'srun
command, but this is a little too much insider baseball for most users. We plan on making this behave "as expected" by a different mechanism: when an alloc is retired, instead of also removing its file repository, it is instead subsumed into an instance file repository. This becomes similar to a look-aside cache: when cache transfers are performed, we first look in the instance's repository before transferring files from the (remote) cache. The additional complication here is that we'll need to garbage collect the instance repository whenever extra disk space is needed on the instance.
thanks for the detailed response! Everything makes sense here except for the first one about async submission. What happens if I tell reflow to run and then Ctrl-C?
Reflow behaves just like any other program interpreter: ctrl-c stops execution. Of course, because Reflow is incremental, and intermediate reductions are cached, if you start it up again, it's (most of the time) able to pick up from where it stopped..
Also to clarify: when you hit control-C, reflow will fail to maintain keepalives to the worker instances; this will cause them to automatically terminate after 10 minutes of idleness. This is also noted in the README:
Reflow comes with a built-in cluster manager, which is responsible for elastically increasing or decreasing required compute resources. The AWS EC2 cluster manager keeps track of instance type availability and account limits, and uses these to launch the most appropriate set of instances for a given job. When instances become idle, they will terminate themselves if they are idle for more than 10 minutes; idle instances are reused when possible.
Does it catch Ctrl+C and send a stop request to the agent, or reflow is sending keep alives to the agent? Ctrl+C is maybe not the best example because it can be caught. What happens when I lose all network connectivity from where I am running reflow? If I close my laptop then my job stops? So works a lot better when submitted from a tmux session on a EC2 workstation?
Thanks for explaining. Idle instance termination is really a separate concern from this question since another job may be scheduled to the instance, so I don't think the README statement explains this aspect.
Reflow maintains a keep alive to all of its resource allocations. Thus if the job is killed it in any way, or loses connectivity, the keepalive is no longer maintained. An instance is considered idle if none of its allocs are alive.
Thanks for all the explanations! I will close this, but you could re-open it if it serves as a useful reminder to add an architecture doc.
I am quite surprised by some aspects of the current operational architecture, although I see how it was probably the easiest way to get things to a good enough state. I use a more centralized job flow controller that I could try connecting this to in order to achieve an async workflow, etc.
The design may be unusual but it's not done out of ease or convenience. This design lets us very easily control the whole compute stack, and provision resources as they are needed—the cloud is elastic after all, and Reflow fully exploits this.
With respect to synchronous vs. asynchronous execution, I view this as an orthogonal concern: an external system can be responsible for managing individual Reflow jobs. (This is what we do at GRAIL.)
This split of responsibilities keep things both orthogonal and simple: Reflow interprets and executes Reflow programs on a cloud provider; a separate system is responsible for managing a set of jobs.
Sorry, I may have mis-understood your description, going off an architecture doc would make things clearer :)
Also makes a lot more sense that you do have a system for async execution. Would be great to see the architecture of that also even if there is no implementation provided.
The code is beautiful. Just want to ask if you plan to publish an architecture doc soon. Thanks!
Hi @chenpengcheng, thanks! Yes, this is getting closer to the top of my list...
Asynchronous submission is an orthogonal concern: that being said, we are planning on providing such a feature natively in Reflow.
I view this as an orthogonal concern: an external system can be responsible for managing individual Reflow jobs. (This is what we do at GRAIL.)
@mariusae Is this still on the list and still supposed to go in Reflow core?
As always, very interesting project!
@hchauvin yep, @prasadgopal is working on this actively right now :-)
Awesome, thanks!