DeepSpeed
DeepSpeed copied to clipboard
Profiler enhancements: adding memory details to profilder, adding dtype breakdown for memory and params
Most of the changes in this PR are opinionated with the main goal to start a discussion. I'm happy to address the feedback, update the PR and documentation accordingly.
Here's an overview:
- adding Memory details calculated based on number of parameters and their size (vs. total model memory))
- adding dtype breakdown for both Memory and Params sections that show the Module composition (this for example could also be controlled by a new configuration option - not implemented in this PR though, a subject of discussion)
- repetitive unit conversion is consolidated into a single method:
number_to_string - fixing minor artifacts like empty trailing commas because of missing optional sections, truncating trailing 0 decimals, etc.
MegatronBertModel(
475.1 M = 475.1 M bfp16 = 100% Params, 303.15 GMACs = 100% MACs, 950.2 MB = 950.2 MB bfp16 = 100% Memory, 72.83 ms = 100% latency, 8.33 TFLOPS
(model): BertModel(
470.27 M = 470.27 M bfp16 = 98.98% Params, 178.07 GMACs = 58.74% MACs, 940.53 MB = 940.53 MB bfp16 = 98.98% Memory, 42.57 ms = 58.45% latency, 8.37 TFLOPS
(language_model): TransformerLanguageModel(
470.27 M = 470.27 M bfp16 = 98.98% Params, 178.07 GMACs = 58.74% MACs, 940.53 MB = 940.53 MB bfp16 = 98.98% Memory, 42.22 ms = 57.97% latency, 8.44 TFLOPS
(embedding): Embedding(
387.91 M = 387.91 M bfp16 = 81.65% Params, 0 MACs = 0% MACs, 775.81 MB = 775.81 MB bfp16 = 81.65% Memory, 816.58 μs = 1.12% latency, 0 FLOPS
(word_embeddings): VocabParallelEmbedding(387.13 M = 387.13 M bfp16 = 81.48% Params, 0 MACs = 0% MACs, 774.26 MB = 774.26 MB bfp16 = 81.48% Memory, 204.56 μs = 0.28% latency, 0 FLOPS)
(position_embeddings): Embedding(774.14 K = 774.14 K bfp16 = 0.16% Params, 0 MACs = 0% MACs, 1.55 MB = 1.55 MB bfp16 = 0.16% Memory, 188.35 μs = 0.26% latency, 0 FLOPS, 512, 1512)
(embedding_dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 B = 0% Memory, 123.98 μs = 0.17% latency, 0 FLOPS, p=0.1, inplace=False)
)
(encoder): ParallelTransformer(
82.36 M = 82.36 M bfp16 = 17.34% Params, 178.07 GMACs = 58.74% MACs, 164.72 MB = 164.72 MB bfp16 = 17.34% Memory, 41.3 ms = 56.71% latency, 8.62 TFLOPS
(layers): ModuleList(
82.36 M = 82.36 M bfp16 = 17.34% Params, 178.07 GMACs = 58.74% MACs, 164.72 MB = 164.72 MB bfp16 = 17.34% Memory, 40.97 ms = 56.26% latency, 8.69 TFLOPS
(0): ParallelTransformerLayer(
27.45 M = 27.45 M bfp16 = 5.78% Params, 59.36 GMACs = 19.58% MACs, 54.91 MB = 54.91 MB bfp16 = 5.78% Memory, 14.03 ms = 19.26% latency, 8.47 TFLOPS
(input_layernorm): MixedFusedLayerNorm(3.02 K = 3.02 K bfp16 = 0% Params, 0 MACs = 0% MACs, 6.05 KB = 6.05 KB bfp16 = 0% Memory, 122.31 μs = 0.17% latency, 0 FLOPS)
(attention): ParallelAttention(
9.15 M = 9.15 M bfp16 = 1.93% Params, 21.9 GMACs = 7.22% MACs, 18.3 MB = 18.3 MB bfp16 = 1.93% Memory, 5.98 ms = 8.21% latency, 7.33 TFLOPS
(query_key_value): ColumnParallelLinear(6.86 M = 6.86 M bfp16 = 1.44% Params, 14.05 GMACs = 4.63% MACs, 13.73 MB = 13.73 MB bfp16 = 1.44% Memory, 2.89 ms = 3.96% latency, 9.74 TFLOPS)
(scale_mask_softmax): FusedScaleMaskSoftmax(0 = 0% Params, 0 MACs = 0% MACs, 0 B = 0% Memory, 379.8 μs = 0.52% latency, 38.65 GFLOPS)
(attention_dropout): Dropout(0 = 0% Params, 0 MACs = 0% MACs, 0 B = 0% Memory, 187.87 μs = 0.26% latency, 0 FLOPS, p=0.1, inplace=False)
(dense): RowParallelLinear(2.29 M = 2.29 M bfp16 = 0.48% Params, 4.68 GMACs = 1.54% MACs, 4.58 MB = 4.58 MB bfp16 = 0.48% Memory, 1.02 ms = 1.4% latency, 9.2 TFLOPS)
)
(post_attention_layernorm): MixedFusedLayerNorm(3.02 K = 3.02 K bfp16 = 0% Params, 0 MACs = 0% MACs, 6.05 KB = 6.05 KB bfp16 = 0% Memory, 105.14 μs = 0.14% latency, 0 FLOPS)
(mlp): ParallelMLP(
18.3 M = 18.3 M bfp16 = 3.85% Params, 37.46 GMACs = 12.36% MACs, 36.59 MB = 36.59 MB bfp16 = 3.85% Memory, 7.31 ms = 10.04% latency, 10.24 TFLOPS
(dense_h_to_4h): ColumnParallelLinear(9.15 M = 9.15 M bfp16 = 1.93% Params, 18.73 GMACs = 6.18% MACs, 18.3 MB = 18.3 MB bfp16 = 1.93% Memory, 3.51 ms = 4.82% latency, 10.68 TFLOPS)
(dense_4h_to_h): RowParallelLinear(9.15 M = 9.15 M bfp16 = 1.93% Params, 18.73 GMACs = 6.18% MACs, 18.29 MB = 18.29 MB bfp16 = 1.93% Memory, 3.6 ms = 4.94% latency, 10.42 TFLOPS)
)
)
...
Hi @cli99, can you please have a look and let me know what you think?
Sorry to bug, @cli99, but is the list of reviewers for this file accurate or should I reach out to someone else? Thanks!
@clumsy , sorry for the late response, I was out of office for a while. Can you please resolve the conflicts with the main branch? Thanks.
Sure @cli99, are you in favor of the proposed changes though? Now I also need to find a way to make it reflect the expert parameters.
Sure @cli99, are you in favor of the proposed changes though? Now I also need to find a way to make it reflect the expert parameters.
Most changes look good to me. The "dtype breakdown for both Memory and Params section" and the corresponding fields in the expr are a bit unclear to me. Can you provide a description of an example module expr and when this will be useful.
Thanks for checking the PR, @cli99. dtype breakdown is basically to show the type of underlying parameters, especially if it's fixed and not controlled by the DeepSpeed config. But I agree it's perhaps too much especially since we would need to keep track of both dense and expert parameters. I'll remove that part and keep the overall/per-layer memory then.
After some deliberation I've decided to separate formatting fixes from memory details (which I believe are not that useful after all since you can roughly multiply num_params with the fp32/fp16/fp16 size). I'll cut a separate PR for this.