arkouda icon indicating copy to clipboard operation
arkouda copied to clipboard

Should we use chpl_serializeSlices?

Open reuster986 opened this issue 4 years ago • 7 comments

In the discussion of PR #618 @bradcray mentioned that adding -schpl_serializeSlices=true to the makefile would help the performance of slicing distributed arrays. He mentioned having brought it up in the past, but I can't remember the discussion or whether we decided anything, so I wanted to start a thread for it.

Some questions:

  • What exactly does this flag do?
  • How much would we expect it to help with performance? With scaling?
  • Are there any known issues or trade-offs with it?

reuster986 avatar Jan 12 '21 13:01 reuster986

Hi @reuster986 —

-schpl_serializeSlices=true enables some behavior that we intend to enable by default in the future, yet which we haven't felt ready to do yet (because in some cases it causes a communication tradeoff that isn't always beneficial, and with more effort, we believe we can effectively make it a "virtually always-win" optimization that deserves to be on by default; see the next paragraph for details). In short, today when we slice a distributed array, we do a bunch of proactive communication to all locales to let them know about the slice, whether or not they own a piece of it or ever will. The "serialize slice" optimization mode takes more of a lazy approach, creating the distributed array slice locally, and then only letting other locales know about it on a "need-to-know" basis (only the locales that need to know about it, and only when they need to know about it). This communication of the slice is rolled into an existing communication that's already happening for that locale, so adds some data to the payload, but doesn't induce additional communications (when it's working as intended). You can read more about the optimization starting at slide 15 here: https://chapel-lang.org/releaseNotes/1.20/03-arrays.pdf

The effect on performance / scaling isn't trivial to characterize. For the statement we were talking about in that PR, it reduced communication to a smaller overall number and to only involve the locales that needed to be, as designed. In other cases, we've seen it reduce the number of active messages, yet increase the number of gets, which may or may not be a net win, depending on whether the gets result in a network bottleneck. This is why we chose not to enable the optimization by default yet, and the remaining work to enable it involves removing the need for those gets. I've lost my context for this work to the point that I can't do a very good job of characterizing when this tradeoff occurs or whether it is likely to come up much in Arkouda. @e-kayrakli has been working in this area recently, and may be able to fill in those gaps for me.

Unless the answer is obvious or known to Engin: for Arkouda, I'd probably take the approach of running some benchmarks or operations you care about at scale with and without the flag to see whether it's a net win or loss.

bradcray avatar Jan 12 '21 17:01 bradcray

I've lost my context for this work to the point that I can't do a very good job of characterizing when this tradeoff occurs or whether it is likely to come up much in Arkouda. @e-kayrakli has been working in this area recently, and may be able to fill in those gaps for me.

I can elaborate a bit more here. When you throw in -schpl_serializeSlices=true we disable that proactive communication that Brad mentioned for slices, because we assume that we are going to serialize and forward the serial buffer and deserialize on the receiving end when you have an on statement that refers to a slice from the outer scope. That's a "win" scenario for -schpl_serializeSlices=true

However, if you zip over slices like:

forall (a,b) in zip(A[3..10]. B[3..10]) do foo();

we end up creating a tuple that stores references to the slice instances. And currently we don't know how to forward such tuples. (Or at the very least, we know how we can do it, but we are thinking of other approaches). So, when you have schpl_serializeSlices=true, we never do the proactive communication to let the locales know about A[3..10] and B[3..10]; and never actually serialize and send them to the other locales (because they are inside a tuple for which we don't do any serialization). This causes us to do GETs for the follower loop we write for that forall, because that follower loop is actually inside an on statement.

i.e. the above forall turns into something like:

ref ASlice = A[3..10];
ref BSlice = B[3..10];
for followThis in ASlice.leader {
  on someOtherLocaleDecidedByASliceLeader {
    for (a,b) in zip(ASlice.follower(followThis), BSlice.follower(followThis)) {  // ASlice and BSlice are remote here
      // create tasks that call foo
    }
  }
}

e-kayrakli avatar Jan 12 '21 18:01 e-kayrakli

Building on Engin's response, we believe that idioms that are semantically equivalent to zippering can also cause the increase in gets. For example:

C = A[3..10] + B[3..10];

can be thought of as being equivalent to:

C = [(a,b) in zip(A[3..10], B[3..10])] a + b;

so may also incur the get penalty.

bradcray avatar Jan 12 '21 18:01 bradcray

Belatedly thinking to tag @ronawho on here in case he wants to do any comparisons with vs. without the flag for the core Arkouda benchmarks.

bradcray avatar Jan 16 '21 02:01 bradcray

I'm happy to run some benchmarks, but I don't think any of the current core benchmarks will be impacted by this:

  • stream definitely shouldn't be
  • gather/scatter/argsort/coargsort should mostly be aggregation and shouldn't be impacted by this
  • scan/reduce are really just chapel scan/reduce so shouldn't be impacted either

ronawho avatar Jan 16 '21 17:01 ronawho

This reminds me, I should make a benchmark for concatenate, which would definitely probe the effect of this option.

reuster986 avatar Jan 19 '21 13:01 reuster986

We're looking to enable chpl_serializeSlices by default in https://github.com/chapel-lang/chapel/pull/18911. I would recommend just waiting for us to make the switch.

As part of us changing that default I'd like to check Arkouda performance. Assuming things look good, there are probably a couple places in Arkouda that we could clean up:

  • Should be able to use slice-assignment instead of lowLevelLocalizingSlice for in1d-assoc: https://github.com/Bears-R-Us/arkouda/blob/4064e27106261dd79219592b64ae5574909d9499/src/In1d.chpl#L36-L37
  • Should be able to use slice-assignment instead of aggregation for SipHash: https://github.com/Bears-R-Us/arkouda/pull/1060#discussion_r802663518

ronawho avatar Feb 09 '22 17:02 ronawho

@Bears-R-Us/arkouda-core-dev & @ronawho - It looks like we already have this implemented to some extent, Should this issue remain open?

Ethan-DeBandi99 avatar Dec 19 '22 19:12 Ethan-DeBandi99

I think we can close it. This issue is effectively asking if Arkouda should enable a currently experimental chapel flag, but I believe we decided to just wait for chapel to enable it by default.

ronawho avatar Dec 20 '22 15:12 ronawho