charm icon indicating copy to clipboard operation
charm copied to clipboard

CkArray: Remove duplicate array message delivery path

Open epmikida opened this issue 3 years ago • 5 comments

There currently exist two delivery paths for array messages. One was intended as an optimization, but with the new refactor of array message delivery, it is no longer needed, and now only adds complication as the two paths had diverged.

epmikida avatar Jul 29 '21 21:07 epmikida

Still in progress, but I put it here to get input from @nitbhat since the primary thing it affects is some ZC stuff.

The main question I have is w.r.t. the zero copy message buffering. Is that something that has to happen per location, or per array element? The distinction only matters in cases where you have multiple arrays bound together. If a particular index has a buffered rget message, do all elements for that index need to buffer all incoming communication, or does just that specific array element? I also have some other questions that I'll inline directly in the code.

epmikida avatar Jul 29 '21 22:07 epmikida

Ok, so I can't comment on unchanged files, but this function is where most of my questions lie.

  1. Why is ckJustMigrated called here? And on all chares?
  2. The resume from sync stuff seems to imply that maybe this is in fact a property of the location not the element.
  3. Is there a reason to use CmiHandleMessage to go all the way back through the queue? My thought was to treat this message buffering like any other message buffering, and just call CkArray::recvMsg (or something similar) rather than going all the way back through Cmi message handling. But I wasn't sure if there was a specific reason for that.
void CkLocMgr::processAfterActiveRgetsCompleted(CmiUInt8 id)
{
  // Call ckJustMigrated
  CkLocRec* myLocRec = elementNrec(id);
  callMethod(myLocRec, &CkMigratable::ckJustMigrated);

  // Call ResumeFromSync on elements that were waiting for rgets
  auto iter2 = toBeResumeFromSynced.find(id);
  if (iter2 != toBeResumeFromSynced.end())
  {
    iter2->second->ResumeFromSync();
    toBeResumeFromSynced.erase(iter2);
  }

  // Deliver buffered messages to the elements that were waiting on rgets
  auto iter = bufferedActiveRgetMsgs.find(id);
  if (iter != bufferedActiveRgetMsgs.end())
  {
    std::vector<CkArrayMessage*> bufferedMsgs = iter->second;
    bufferedActiveRgetMsgs.erase(iter);
    for (auto msg : bufferedMsgs)
    {
      CmiHandleMessage(UsrToEnv(msg));
    }
  }
}

epmikida avatar Jul 29 '21 22:07 epmikida

Still in progress, but I put it here to get input from @nitbhat since the primary thing it affects is some ZC stuff.

The main question I have is w.r.t. the zero copy message buffering. Is that something that has to happen per location, or per array element? The distinction only matters in cases where you have multiple arrays bound together. If a particular index has a buffered rget message, do all elements for that index need to buffer all incoming communication, or does just that specific array element? I also have some other questions that I'll inline directly in the code.

Sorry for the delay in getting back @epmikida. The buffering happens per array element when the array element has an active pup based RDMA Get call in progress and it is not safe to execute entry methods on that array element. If there is a corresponding bound array element, an active rget in one doesn't impact the other element, so it seems like this should be entirely array element specific.

nitbhat avatar Oct 05 '21 20:10 nitbhat

  1. Why is ckJustMigrated called here? And on all charges?

Since ckJustMigrated is called at the end of migration, for chares migrated with ZC Pup, I don't call that method when there are active rgets inside CkLocMgr::immigrate.

3575   if (!zcRgetsActive)$
3576   {$
3577     // Let all the elements know we've arrived$
3578     callMethod(rec, &CkMigratable::ckJustMigrated);$
3579   }$
3580 $

For that reason, I need to call it after the rget has completed signifying the completion of migration.

  1. The resume from sync stuff seems to imply that maybe this is in fact a property of the location not the element.

I think I put this code in CkLogMgr because the older location management code was in here. But, in theory, a zc pup based rget is only array element dependent and the reason I buffer messages for that array element is that the element's data is not safe to be operated upon by executing entry methods until the rgets are complete. A bound array element bound to this element should not be impacted by active rgets.

  1. Is there a reason to use CmiHandleMessage to go all the way back through the queue? My thought was to treat this message buffering like any other message buffering, and just call CkArray::recvMsg (or something similar) rather than going all the way back through Cmi message handling. But I wasn't sure if there was a specific reason for that.

There is no reason to call CmiHandleMessage. I can simply call a higher-level hander function.

nitbhat avatar Oct 05 '21 20:10 nitbhat

  1. The resume from sync stuff seems to imply that maybe this is in fact a property of the location not the element.

I think I put this code in CkLogMgr because the older location management code was in here. But, in theory, a zc pup based rget is only array element dependent and the reason I buffer messages for that array element is that the element's data is not safe to be operated upon by executing entry methods until the rgets are complete. A bound array element bound to this element should not be impacted by active rgets.

I'm not sure about this. A defining quality of bound array elements is that they migrate together. I think in AMPI we rely on the ordering of packing/unpacking bound array elements being consistent, so that we can get pointers into data owned by the "parent" AMPI chare array when PUPing the "child" AMPI chare array. I think if we allow migration to finish in one array element independent of another bound array which has active rgets, then the first may end up trying to get a pointer to some data that hasn't yet been transferred via rget in the second.

stwhite91 avatar Oct 08 '21 16:10 stwhite91