parsec icon indicating copy to clipboard operation
parsec copied to clipboard

Implement GET protocol for dependencies

Open devreal opened this issue 3 years ago • 8 comments

Avoids a round-trip by directly fetching data when a dependency release arrives. Adds runtime_comm_get MCA parameter to enable use of the GET protocol.

Currently enabled to have it worked by CI. I'm not sure I am using the right datatypes since the reshape and redistribute tests are failing...

Signed-off-by: Joseph Schuchart [email protected]

devreal avatar Aug 16 '22 21:08 devreal

Have you looked at performance at all? I think I've discussed with @bosilca ad nauseam regarding the implications regarding communication prioritization. I might pull a version of this into my branch at some point so I can test.

I'm not sure I am using the right datatypes since the reshape and redistribute tests are failing...

There could very well be a bug in the GET implementation for the MPI comm engine that has gone undiagnosed since it's not been well-tested.

omor1 avatar Aug 17 '22 00:08 omor1

I implemented this to have a better baseline in the comparison with TTG (which does GET instead of PUT). As far as I remember, there was little to no benefit in terms of performance (didn't get worse though).

devreal avatar Aug 17 '22 00:08 devreal

didn't get worse though

That's relevant! How much did you scale and were you using THREAD_MULTIPLE?

George's hypotheses regarding this are, if I recall correctly, along the lines that a sender can more easily regulate how much data it pushes onto the network than the receiver—with a GET protocol the sender doesn't have as much ability to prioritize communications, so multiple receivers can end up competing for a sender's bandwidth. On the other hand, a PUT protocol can overwhelm a receiver with many incoming messages, but that shouldn't be the case for most PaRSEC applications since the receiver also regulates which data it requests to be sent.

omor1 avatar Aug 17 '22 00:08 omor1

Sweet, it seems to have been a problem with the termination detection @therault :) All checks pass now

devreal avatar Sep 23 '22 16:09 devreal

Argh of course not, CI doesn't test the GET protocol. Test fail if run with PARSEC_MCA_runtime_comm_get=1. I still get MPI_ERR_TRUNCATE errors, so there is a problem with GET.

devreal avatar Sep 23 '22 17:09 devreal

Argh of course not, CI doesn't test the GET protocol. Test fail if run with PARSEC_MCA_runtime_comm_get=1. I still get MPI_ERR_TRUNCATE errors, so there is a problem with GET.

MPI_ERR_TRUNCATE should only occur if you send more than the matching receive expects right?

I think I might have found the bug. When a process gets too many internal GET AMs (or is asked to do too many PUTs), then it defers starting the MPI_Isend until later by pushing to mpi_funnelled_dynamic_req_fifo. When starting the send later, it expects the tag to be used to be in cb.cb_type.onesided.size—why there, I don't know. mpi_no_thread_put does this correctly, but mpi_funnelled_internal_get_am_callback seems to do things terribly wrong—I have no idea if it even sends to the correct destination in this case, or what data it's sending, since it's using entirely different fields in the cb_type union. Wow this seems broken.

Like I said, no one has tested GET really.

omor1 avatar Sep 23 '22 20:09 omor1

Sigh, thanks for checking @omor1. I'm fairly sure I had it working at some point before the big merge. The MPI backend is still student research quality, at best. I hate the fact that we're putting data into random fields, makes the code unmaintainable. I guess It's good to have pointer though, will have to take a closer look at it again...

devreal avatar Sep 23 '22 20:09 devreal

The MPI backend is still student research quality, at best.

"research quality"

The LCI backend is better, in my humble opinion, and is certainly better-documented. It still has a decent amount of jankiness from various things I tried and haven't fully ripped out, but is more maintainable.

omor1 avatar Sep 23 '22 20:09 omor1