Polygeist icon indicating copy to clipboard operation
Polygeist copied to clipboard

Add support for splitting if's with barriers in parallel ops without interchanging them

Open ivanradanov opened this issue 2 years ago • 2 comments

This pr adds two new ways to handle ifs with barriers in parallel regions.

This is the current way it is done:

parallel {
  A()
  if {
    B()
    barrier
    C()
  }
  D()
}

->

parallel {
  A()
}
if {
  parallel {
    B()
  }
  parallel {
    C()
  }
}
parallel {
  D()
}

The first one allows ifs with directly nested barriers to be split at the barrier without the need to split them off with barriers and interchange them with the parallel op as such:

parallel {
  A()
  if {
    B()
    barrier
    C()
  }
  D()
}

->

parallel {
  A()
  if {
    B()
  }
}
parallel {
  if {
    C()
  }
  D()
}

This should hopefully improve performance since it keeps A, B and C,D in the same parallel region.

The second one joins the appropriate blocks for the two cases where the if condition evaluates to true or false

parallel {
  A()
  if {
    B()
    barrier
    C()
  }
  D()
}

->

if {
  parallel {
    A()
    B()
  }
  parallel {
    C()
    D()
  }
} else {
  parallel {
    A()
    D()
  }
}

This allows us to get rid of the branch in the parallel at the cost of increased code size. This second way actually makes the code size explode exponentially wrt the number of barriers so it might only have limited use with the help of some heuristics (not yet implemented) to decide when to use it.

ivanradanov avatar Aug 04 '22 07:08 ivanradanov

Can this not alternatively become the following, avoiding code duplication?

parallel {
  A()
  if {
    B()
  }
}
parallel {
  if {
    C()
  }
  D()
}

wsmoses avatar Aug 11 '22 20:08 wsmoses

One can choose between

parallel {
  A()
  if {
    B()
  }
}
parallel {
  if {
    C()
  }
  D()
}

and

if {
  parallel {
    A()
    B()
  }
  parallel {
    C()
    D()
  }
} else {
  parallel {
    A()
    D()
  }
}

by specifying --cpuify="distribute.ifsplit" or --cpuify="distribute.ifhoist" respectively

(the default is still the original old way)

Both of the new ways result in close to no overall performance difference on all of rodinia combined, with individual benchmark speedups seemingly ranging from -7% to +4% and -2% to +2% respectively compared to the current transformation. (some of it could be attributed to randomness)

ivanradanov avatar Aug 12 '22 00:08 ivanradanov