DeepSpeed Profiler enhancements: adding memory details to profilder, adding dtype breakdown for memory and params

Most of the changes in this PR are opinionated with the main goal to start a discussion. I'm happy to address the feedback, update the PR and documentation accordingly.

Here's an overview:

adding Memory details calculated based on number of parameters and their size (vs. total model memory))
adding dtype breakdown for both Memory and Params sections that show the Module composition (this for example could also be controlled by a new configuration option - not implemented in this PR though, a subject of discussion)
repetitive unit conversion is consolidated into a single method: number_to_string
fixing minor artifacts like empty trailing commas because of missing optional sections, truncating trailing 0 decimals, etc.

MegatronBertModel(
  475.1 M = 475.1 M bfp16 = 100% Params, 303.15 GMACs = 100% MACs, 950.2 MB = 950.2 MB bfp16 = 100% Memory, 72.83 ms = 100% latency, 8.33 TFLOPS
  (model): BertModel(
    470.27 M = 470.27 M bfp16 = 98.98% Params, 178.07 GMACs = 58.74% MACs, 940.53 MB = 940.53 MB bfp16 = 98.98% Memory, 42.57 ms = 58.45% latency, 8.37 TFLOPS
    (language_model): TransformerLanguageModel(
      470.27 M = 470.27 M bfp16 = 98.98% Params, 178.07 GMACs = 58.74% MACs, 940.53 MB = 940.53 MB bfp16 = 98.98% Memory, 42.22 ms = 57.97% latency, 8.44 TFLOPS
      (embedding): Embedding(
        387.91 M = 387.91 M bfp16 = 81.65% Params, 0 MACs = 0% MACs, 775.81 MB = 775.81 MB bfp16 = 81.65% Memory, 816.58 μs = 1.12% latency, 0 FLOPS
        (word_embeddings): VocabParallelEmbedding(387.13 M = 387.13 M bfp16 = 81.48% Params, 0 MACs = 0% MACs, 774.26 MB = 774.26 MB bfp16 = 81.48% Memory, 204.56 μs = 0.28% latency, 0 FLOPS)
        (position_embeddings): Embedding(774.14 K = 774.14 K bfp16 = 0.16% Params, 0 MACs = 0% MACs, 1.55 MB = 1.55 MB bfp16 = 0.16% Memory, 188.35 μs = 0.26% latency, 0 FLOPS, 512, 1512)
        (embedding_dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 B = 0% Memory, 123.98 μs = 0.17% latency, 0 FLOPS, p=0.1, inplace=False)
      )
      (encoder): ParallelTransformer(
        82.36 M = 82.36 M bfp16 = 17.34% Params, 178.07 GMACs = 58.74% MACs, 164.72 MB = 164.72 MB bfp16 = 17.34% Memory, 41.3 ms = 56.71% latency, 8.62 TFLOPS
        (layers): ModuleList(
          82.36 M = 82.36 M bfp16 = 17.34% Params, 178.07 GMACs = 58.74% MACs, 164.72 MB = 164.72 MB bfp16 = 17.34% Memory, 40.97 ms = 56.26% latency, 8.69 TFLOPS
          (0): ParallelTransformerLayer(
            27.45 M = 27.45 M bfp16 = 5.78% Params, 59.36 GMACs = 19.58% MACs, 54.91 MB = 54.91 MB bfp16 = 5.78% Memory, 14.03 ms = 19.26% latency, 8.47 TFLOPS
            (input_layernorm): MixedFusedLayerNorm(3.02 K = 3.02 K bfp16 = 0% Params, 0 MACs = 0% MACs, 6.05 KB = 6.05 KB bfp16 = 0% Memory, 122.31 μs = 0.17% latency, 0 FLOPS)
            (attention): ParallelAttention(
              9.15 M = 9.15 M bfp16 = 1.93% Params, 21.9 GMACs = 7.22% MACs, 18.3 MB = 18.3 MB bfp16 = 1.93% Memory, 5.98 ms = 8.21% latency, 7.33 TFLOPS
              (query_key_value): ColumnParallelLinear(6.86 M = 6.86 M bfp16 = 1.44% Params, 14.05 GMACs = 4.63% MACs, 13.73 MB = 13.73 MB bfp16 = 1.44% Memory, 2.89 ms = 3.96% latency, 9.74 TFLOPS)
              (scale_mask_softmax): FusedScaleMaskSoftmax(0 = 0% Params, 0 MACs = 0% MACs, 0 B = 0% Memory, 379.8 μs = 0.52% latency, 38.65 GFLOPS)
              (attention_dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 B = 0% Memory, 187.87 μs = 0.26% latency, 0 FLOPS, p=0.1, inplace=False)
              (dense): RowParallelLinear(2.29 M = 2.29 M bfp16 = 0.48% Params, 4.68 GMACs = 1.54% MACs, 4.58 MB = 4.58 MB bfp16 = 0.48% Memory, 1.02 ms = 1.4% latency, 9.2 TFLOPS)
            )
            (post_attention_layernorm): MixedFusedLayerNorm(3.02 K = 3.02 K bfp16 = 0% Params, 0 MACs = 0% MACs, 6.05 KB = 6.05 KB bfp16 = 0% Memory, 105.14 μs = 0.14% latency, 0 FLOPS)
            (mlp): ParallelMLP(
              18.3 M = 18.3 M bfp16 = 3.85% Params, 37.46 GMACs = 12.36% MACs, 36.59 MB = 36.59 MB bfp16 = 3.85% Memory, 7.31 ms = 10.04% latency, 10.24 TFLOPS
              (dense_h_to_4h): ColumnParallelLinear(9.15 M = 9.15 M bfp16 = 1.93% Params, 18.73 GMACs = 6.18% MACs, 18.3 MB = 18.3 MB bfp16 = 1.93% Memory, 3.51 ms = 4.82% latency, 10.68 TFLOPS)
              (dense_4h_to_h): RowParallelLinear(9.15 M = 9.15 M bfp16 = 1.93% Params, 18.73 GMACs = 6.18% MACs, 18.29 MB = 18.29 MB bfp16 = 1.93% Memory, 3.6 ms = 4.94% latency, 10.42 TFLOPS)
            )
          )
...

May 19 '23 14:05 clumsy

Hi @cli99, can you please have a look and let me know what you think?

May 23 '23 15:05 clumsy

Sorry to bug, @cli99, but is the list of reviewers for this file accurate or should I reach out to someone else? Thanks!

Jun 09 '23 18:06 clumsy

@clumsy , sorry for the late response, I was out of office for a while. Can you please resolve the conflicts with the main branch? Thanks.

Jun 21 '23 16:06 cli99

Sure @cli99, are you in favor of the proposed changes though? Now I also need to find a way to make it reflect the expert parameters.

Jun 21 '23 20:06 clumsy

Sure @cli99, are you in favor of the proposed changes though? Now I also need to find a way to make it reflect the expert parameters.

Most changes look good to me. The "dtype breakdown for both Memory and Params section" and the corresponding fields in the expr are a bit unclear to me. Can you provide a description of an example module expr and when this will be useful.

Jun 22 '23 16:06 cli99

Thanks for checking the PR, @cli99. dtype breakdown is basically to show the type of underlying parameters, especially if it's fixed and not controlled by the DeepSpeed config. But I agree it's perhaps too much especially since we would need to keep track of both dense and expert parameters. I'll remove that part and keep the overall/per-layer memory then.

Jun 23 '23 16:06 clumsy

After some deliberation I've decided to separate formatting fixes from memory details (which I believe are not that useful after all since you can roughly multiply num_params with the fp32/fp16/fp16 size). I'll cut a separate PR for this.

Jul 11 '23 19:07 clumsy

DeepSpeed DeepSpeed copied to clipboard

Profiler enhancements: adding memory details to profilder, adding dtype breakdown for memory and params

DeepSpeed
DeepSpeed copied to clipboard