calc_transformer_mem.py is inaccurate for most popular open models
Running calc_transformer_mem.py with the parameters for Qwen1.5-72B prints that this model has 56.19 billion parameters, while the real number is around 72 billion:
python calc_transformer_mem.py --infer --high-prec-bytes-per-val 4 --low-prec-bytes-per-val 1 --num-gpus 2 --zero-stage 3 -ca -b 1 -s 1024 -v 152064 -hs 8192 -a 64 -l 80 -kv 1 -ff 3
My guess this is because the script assumes two linear layers per MLP block, while most popular open source models like Llama, Mixtral, Qwen, etc. have three:
https://github.com/huggingface/transformers/blob/6e584070d4f86964c4268baed08a5a5da8f82633/src/transformers/models/llama/modeling_llama.py#L240
(Also, the --ffn-expansion-factor flag requires an integer, while for Llama-2-70B I believe it's 3.5? --low-prec-bytes-per-val will also be less than 1 for quantized models.)