clash-compiler Concurrent normalization

Still TODO:

[x] Write a changelog entry (see changelog/README.md)
[x] Check copyright notices are up to date in edited files
[ ] Make sure the flag to turn it on/off works

Feb 03 '22 14:02 alex-mckenna

@vmchale I recall Alex mentioning he couldn't reliably measure speedups, is that now fixed by using a work queue? Have you got any numbers for us? :)

Mar 22 '22 09:03 martijnbastiaan

I recall Alex mentioning he couldn't reliably measure speedups, is that now fixed by using a work queue?

Part of that was a lack of programs to try it on where there was a significant branching factor when normalizing top-level binders. For the benchmarks / examples we have I was finding that actually not much was running at the same time (since so many entities are also so small this doesn't help). For many examples we have the number of children for a binder is typically 1, which means in the worst case there is no concurrency at all. When there are multiple children, it typically doesn't go higher than 2 or 3 in a design so the actual concurrency you observe is somewhat limited

The point is I don't think this can really be answered without a benchmark program where normalizing a binder can result in many child binders needing to be normalized as a result (i.e. an entity to normalize where it's pretty much guaranteed it will be able to find enough work for all cores on a machine). I think ideally this would be two benchmarks:

one where every child in the tree of things to normalize is distinct. This lets us see just the effect of concurrency without the work queue being able to share any normalization results
one where children can be repeated in the tree. This lets us see the effect of the work queue in further reducing the time needed to normalize the entire design

Mar 22 '22 09:03 alex-mckenna

@martijnbastiaan sort of? It runs faster instead of slower, but not always drastically better.

Let me gather more data.

Mar 22 '22 13:03 vmchale

comparison

Ok so: the work queue means that concurrent normalization isn't pathologically slow, but it isn't particularly impressive either.

This is just the clash-benchmark-normalization we have, not particularly parallel as far as I can tell (?)

Mar 22 '22 17:03 vmchale

Is there a way to easily disable concurrent normalization?

For easy debugging you would likely have to compile without -threaded. This is already somewhat of an implicit recommendation for when you need to use gdb with Haskell

Mar 22 '22 17:03 alex-mckenna

One thing we should probably do is make sure we record transformation history sensibly though. Actually debugging the normalization process becomes somewhat more hectic if rewrites are seen (e.g. through the medium of -fclash-debug-info AppliedTerm) in the other they are performed.

Perhaps that means waiting until an entity is normalized, and only then printing out the rewrite steps it took (with locked access to stdout). That way even if entities aren't printed in the other they're normalized, you still see all their rewrite steps together without steps from other entities interspersed.

Mar 22 '22 18:03 alex-mckenna

Ok so: the work queue means that concurrent normalization isn't pathologically slow, but it isn't particularly impressive either.

Okay, that's good - hopefully it means our examples are just not benefitting from concurrent normalization. An example of a more realistic codebase would be Christiaan's Contranomy, perhaps try there?

(GitHub please copy threaded conversation from GitLab next pls.)

Mar 23 '22 08:03 martijnbastiaan

Other "larger" public code bases I can think of are:

https://github.com/cbiffle/cfm (I started a port to Clash 1.4 over here https://github.com/cbiffle/cfm/pull/2, perhaps port over to Clash 1.6 first)
https://github.com/gergoerdi/clash-spaceinvaders
https://github.com/gergoerdi/clash-compucolor2

Mar 23 '22 08:03 christiaanb

cc @christiaanb

Unfortunately it seems to be a little slower on Contranomy too!

Not concurrent:

time                 7.020 s    (6.929 s .. 7.067 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 6.929 s    (6.882 s .. 6.960 s)
std dev              48.11 ms   (31.76 ms .. 60.14 ms)

Concurrent:

time                 8.506 s    (8.338 s .. 8.615 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 8.438 s    (8.375 s .. 8.479 s)
std dev              61.79 ms   (24.56 ms .. 84.55 ms)

So you can see it's slower but nothing crazy (with work queues)

Mar 23 '22 14:03 vmchale

Yeah, this is fairly similar to my experiences pre Vanessa's improvements, i.e. work queue and non-parallel GC (which I guess doesn't matter here anyway). At least contranomy is large enough that the std dev. doesn't go ridiculous

Mar 23 '22 15:03 alex-mckenna

It may be what we want to do is make a new flag, -fclash-concurrent-normalization which turns on concurrent normalization only. The concurrent topentity compilation remains always on. We document this flag with the caveat it's only worth it for designs where normalization is sufficiently wide that cores have enough to do

Mar 23 '22 15:03 alex-mckenna

Unfortunately cfm seems to require work and in any case we see slowdowns in the majority of our little examples so I think @alex-mckenna's suggestion is the way to go.

Mar 23 '22 16:03 vmchale

To save you looking around for everything, adding a flag is basically:

add the new configuration option to ClashOpts
add the new option to the NFData, Eq and Hashable instances, and the defClashOpts further down in the same file
Define the new flag in ClashFlags.hs
Update the documentation for compiler flags

Then you can use the new flag by getting the ClashEnv from the monad and looking at it's ClashOpts inside.

Mar 24 '22 13:03 alex-mckenna

Oh wow, thank you!!

Mar 24 '22 14:03 vmchale

So a couple of things:

We haven't been able to verify any (meaningful) speedups on public codebases. I'll go around and ask some commercial partners to see if they observe any speedups. If not, I don't think we should merge this PR given that it adds quite a bit of complexity.
We should finish the flag work. Thanks @leonschoorl.
We should clean up the commit history.

No use in doing (2) and (3) before (1) though.

Apr 07 '22 12:04 martijnbastiaan

We've only been able to measure slight slowdowns, and never speedups. Let's keep the branch around, but I doubt we're going to merge this as is.

Mar 15 '23 08:03 martijnbastiaan

clash-compiler clash-compiler copied to clipboard

Concurrent normalization

Still TODO:

clash-compiler
clash-compiler copied to clipboard