Make Raysect run across cluster nodes

Open mattngc opened this issue 6 years ago • 0 comments

To allow Raysect to segment a job and run elements of the job across multiple computational nodes (i.e. a compute cluster), this would require the following components.

Redesign observer base classes to allow jobs to be defined as blocks of pixels, rather than just single pixels. These concepts should probably be relabeled as jobs containing tasks. A task represents the processing of a single pixel. We need to send larger tasks to network nodes due to latency VS compute performance issues. Also, need to modify the observer base to send through pickled scenegraph and relevant information to workers that might now be on remote machines. For the existing RenderEngines, this extra information would be ignored.
Create a new render engine called ClusterEngine(), this engine will take a list of worker nodes (ip or host names) and perform the following actions.

Initialise worker nodes by piping the scenegraph and a reference to the target observer across the network to each node.
Subdivide the full render into a subset of jobs containing multiple tasks.
Setup a producer process that farms out these jobs on request to the cluster nodes.
Setup a consumer process that watch for incoming results from the worker nodes.
Integrate the results into the render.
Reset or shutdown the workers according to the user configuration.

Build a Raysect worker node application. This is an executable that runs on each compute node and waits for commands from the main Raysect process. Given a scenegraph and an observer, render the pixels described in the jobs allocated to this worker.

The number of tasks assigned in each job is a user consumable. The user will need to decide the optimal split of work based on their cluster setup. This parameter needs to be tuned to ensure that the cluster worker nodes have sufficient computational tasks to do while waiting for traffic to propagate over the network.

Sep 13 '19 15:09 mattngc