Implement MPIEngine render engine
This render engine uses SPMD (single process, multiple data) to perform renders in parallel, using MPI (message passing interface). The paradigm is that multiple independent processes are launched, all of which build their own copy of the scenegraph in parallel and all generate rendering tasks. Each process with rank > 0 then works on a subset of those tasks and sends the results to the rank = 0 process, which processes the results. After the render is complete, only process 0 has the complete results.
MPI communications are blocking (send/recv rather than isend/irecv) and use mpi4py's methods for general Python objects, rather than the high performance equivalents utilising the buffer protocol (Send/Recv). The latter is done to avoid loss of generality of the render engine: it supports any function supported by the other render engines, not just ones returning objects supporting the buffer protocol. The use of blocking comms simplifies the implementation and in testing (using the MPI variant of the raysect logo demo) the MPI comms were a small fraction - < 5% - of the total run time, with the vast majority of the time - > 85 % - spent in the render function itself. So the simplicity of implementation does not have a significant performance cost.
Tasks are distributed approximately equally among all workers at startup, with no adaptive scheduling. Again, this is done for simplicity of implementation as an adaptive scheduler would need some way of tracking how long each process was spending on tasks and all processes would need to communicate with one another about which tasks they were taking on. Adding an adaptive scheduler could be left to future work, but in testing any uneven runtime caused by the naive task distribution seemed to only have a small effect on the total run time, so the simplicity of the current solution is advantageous.
One other advantage of this render engine is that it doesn't require the fork() semantics of Linux for efficient sharing of the scenegraph between processes. It'll therefore enable efficient parallel processing on systems which don't implement fork(), such as Windows. This is partly why I named it MPIEngine rather than ClusterEngine, as it's useful on more than just distributed memory systems.
This is branched off from master (v0.8.1) and doesn't have any of the refactor in the feature/ClusterEngine branch, as that was the quickest way to get something prototyped. I'm opening the PR for early feedback before worrying too much about bringing ClusterEngine (and development) up to date with the latest release and then adding this on top.