Benchmark Channels Last
channels-last has an API already: https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html so it may be as simple as doing model.to(memory_format=torch.channels_last) and making sure the same happens to inputs.
@jamesr66a I want to clarify the value prop of doing this at the benchmark infra level. The downside is in runtime cost to collect 2x the measurements and sift through 2x the data. The upside is you get some new signal that's potentially useful. Another (hidden) downside may be that we miss the change to incorporate channels-last as an optimization that we do automatically in our compiler. (granted, we don't have a very full story for using compiler techniques on training benchmarks so that's a gap right now).
I'm conflicted on adding this for the above reason. Thoughts?
I don't really see how benchmarking channels-last as an explicit API precludes us from doing automatic optimizations in our compiler. On the contrary, it can help expose gaps and create development targets for doing so
Additionally, I don't think we should be siloing "The Compiler™" v.s. PyTorch the whole-product. One of the goals of PyTorch is to ensure that the user can get the performance they want. Whether that comes from The Compiler™ is a detail.
Yea, I shouldn't have said 'the compiler'. What I meant by that is that we want to deliver perf enhancements to users without them changing their code. We also sometimes want to deliver perf enhancements that do require them changing their code. Whether 'the compiler' or something else delivers the former is inconsequential.
But for something like channels last, a user could change their code to enable it today. In setting up the suite, we explicitly didn't go an hand-optimize the model code, instead, we used the models as they were in the wild. This is a proxy for finding a balance between what totally naive users might do and what our most advanced perf guides recommend.
So with that framing, do you argue for
- (1) changing a few models (that benefit most) to channels last, and having each model only use one layout
- (2) leaving the layouts alone in the model code, but looking for ways under the hood to use a 'better' layout
- (3) benchmarking both layouts for every model, so we aren't really assuming one way or the other but gathering data for both? I'm not sure if we'd average the results of both into the 'score' or just use the default layout for the score and use the other as auxilliary signal.
So I think my thinking comes from the fact that the API-facing layout (NCHW) was actually an arbitrary choice iiuc in that that's what CuDNN did at the time. However, in thinking about how to best implement these operations on a concrete machine nowadays, many (most?) machines prefer NHWC (including GPUs ironically). I'm basically just pointing that testing a lower-performance case due to historical baggage isn't ideal.
I think there's a separate conversation to be had about how aggressively we should make this optimization automatic, but I think that's orthogonal to whether we benchmark these things or not.
Let me think about which of the options would fit with my thinking here