node_crunch icon indicating copy to clipboard operation
node_crunch copied to clipboard

Mention in docs that it's not data parallel

Open droundy opened this issue 4 years ago • 5 comments

I think in your comparison with mpi it would be worth at least mentioning a huge issue, which is that so far as I can tell, nodecrunch requires the server to store all data. When I've used mpi the main reason was that no node had enough ram to hold the entire computation. I think node crunch is operating in a different problem space, where the data is small, right?

droundy avatar Jul 27 '21 12:07 droundy

Thanks for your feedback! In the mandel example all the data is stored at the server, that's true (this is how the array2d utility structure is working). But it doesn't have to be like that, Node Crunch doesn't need any specific data storage and it's up to the user how to code it. I will add a note to the ReadMe to make that more clear.

willi-kappler avatar Jul 27 '21 14:07 willi-kappler

The statements about the computation continuing when nodes go down at least implies that data must be stored on the server, though, right? I can't see how else it could work to have some nodes go down without restarting the computation as a whole, if data is stored on the nodes.

droundy avatar Jul 27 '21 16:07 droundy

Yes that's true, the data has to be stored somewhere. But not the whole computation has to be restarted, only the part on the node that went down. Ideally each node stores its data in regular intervals on disk / NFS share. Then the server can assign the part of the data to a new node and that new node then can read the part of the computation from the NFS share (or some other network file system) and continue from there.

willi-kappler avatar Jul 29 '21 14:07 willi-kappler

That would require that the nodes do not need to communicate in order to do their calculations, if you could restart one node without affecting the other nodes. Is that a constraint of the design, that it only works with trivially parallelizable tasks? Checkpointing on every iteration would be far beyond prohibitively slow...

droundy avatar Jul 29 '21 16:07 droundy

Why would that require that the nodes do not need to communicate ? If one node goes down the server will notice it since the heartbeat messages from that one node will be no longer sent to the server. This is done automatically in Node Crunch and the user code just gets a notification to react if needed. And the checkpoints do not have to be done every iteration of course, only at specific intervals (defined by the user in a configuration file).

willi-kappler avatar Jul 29 '21 17:07 willi-kappler