TypeError for Mixtral-8x7B-v0.1: unsupported format string passed to NoneType.__format__
🐛 Bug
When running the benchmarks for Mixtral-8x7B-v0.1 for Eager mode we get error:
0: [rank0]: File "/workspace/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 887, in benchmark_main 0: [rank0]: print(f"Tokens/s: {benchmark.perf_metrics['tokens_per_sec']:.02f}") 0: [rank0]: TypeError: unsupported format string passed to NoneType.format
I see in the log that there was a message:
Model Flops/Throughput calculation failed for model Mixtral-8x7B-v0.1. Skipping throughput metric collection.
It might be caused by the fact that in this code in benchmark_litgpt.py:
try:
# Calculate the model FLOPs
self.calculate_model_flops()
# Setup throughput Collection
self.throughput = Throughput(window_size=self.max_iters - self.warmup_iters, world_size=world_size)
except:
self.throughput = None
print(
f"Model Flops/Throughput calculation failed for model {self.model_name}. Skipping throughput metric collection."
)
we have both self.calculate_model_flops() and throughput in try catch block. I'd put there only calculate_model_flops() but maybe there were some problems in getting Throughput and I'm just not aware of them.
Another possible fix is to check if tokens_per_sec is present in the dictionary before accessing it.
To Reproduce
Please use:
8 node(s), each with 8 GPUs. Image "INTERNAL_IMAGE:pjnl-20241001"
Training script:
python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py
--model_name Mixtral-8x7B-v0.1
--distributed_mode fsdp
--shard_mode zero3
--compile eager
--checkpoint_activations True
--low_precision_mode none
--micro_batch_size 1
Expected behavior
We should be able to run the benchmarking script, even if we are not able print a few metrics.
Environment
system.device_product_name DGXH100 system.gpu_driver_version 535.129.03 libraries.cuda 12.6.2.004 libraries.pip.lightning 2.4.0.dev20240728 libraries.pip.lightning-thunder 0.2.0.dev0 libraries.pip.lightning-utilities 0.11.7 libraries.pip.litgpt 0.4.11 libraries.pip.nvfuser 0.2.13+git4cbd7a4 libraries.pip.pytorch-lightning 2.4.0 libraries.pip.torch 2.6.0a0+gitd6d9183 libraries.pip.torchmetrics 1.4.2 libraries.pip.torchvision 0.19.0a0+d23a6e1
Hey @eqy this seems to be an eager mode bug, not related to thunder at all. Could you / group take a look at this?
Actually it's related to benchmark_litgpt.py script. I know one possible fix for it, so I can prepare PR around Wednesday, but it won't solve missing results from calculate_model_flops function.
I think we might get similar problem in another place in the code:
0: [rank0]: Traceback (most recent call last): 0: [rank0]: File "/workspace/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 974, in
0: [rank0]: CLI(benchmark_main) 0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/jsonargparse/_cli.py", line 96, in CLI 0: [rank0]: return _run_component(components, init) 0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/jsonargparse/_cli.py", line 204, in _run_component 0: [rank0]: return component(**cfg) 0: [rank0]: ^^^^^^^^^^^^^^^^ 0: [rank0]: File "/workspace/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 917, in benchmark_main 0: [rank0]: print(f"TFLOP/s: {benchmark.perf_metrics['model_flop_per_sec'] / 1e12:.02f}") 0: [rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~ 0: [rank0]: TypeError: unsupported operand type(s) for /: 'NoneType' and 'float'
I think we might get similar problem in another place in the code:
Could we get a separate issue for this?
While there's no separate issue, here's a pull request fixing the problem reported in https://github.com/Lightning-AI/lightning-thunder/issues/1267#issuecomment-2470086145: https://github.com/Lightning-AI/lightning-thunder/pull/1469