lightning-thunder icon indicating copy to clipboard operation
lightning-thunder copied to clipboard

TypeError for Mixtral-8x7B-v0.1: unsupported format string passed to NoneType.__format__

Open mpatel31415 opened this issue 1 year ago • 1 comments

🐛 Bug

When running the benchmarks for Mixtral-8x7B-v0.1 for Eager mode we get error:

0: [rank0]: File "/workspace/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 887, in benchmark_main 0: [rank0]: print(f"Tokens/s: {benchmark.perf_metrics['tokens_per_sec']:.02f}") 0: [rank0]: TypeError: unsupported format string passed to NoneType.format

I see in the log that there was a message:

Model Flops/Throughput calculation failed for model Mixtral-8x7B-v0.1. Skipping throughput metric collection.

It might be caused by the fact that in this code in benchmark_litgpt.py:

    try:
        # Calculate the model FLOPs
        self.calculate_model_flops()
        # Setup throughput Collection
        self.throughput = Throughput(window_size=self.max_iters - self.warmup_iters, world_size=world_size)
    except:
        self.throughput = None
        print(
            f"Model Flops/Throughput calculation failed for model {self.model_name}. Skipping throughput metric collection."
        )

we have both self.calculate_model_flops() and throughput in try catch block. I'd put there only calculate_model_flops() but maybe there were some problems in getting Throughput and I'm just not aware of them.

Another possible fix is to check if tokens_per_sec is present in the dictionary before accessing it.

To Reproduce

Please use:

8 node(s), each with 8 GPUs. Image "INTERNAL_IMAGE:pjnl-20241001"

Training script: python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py
--model_name Mixtral-8x7B-v0.1
--distributed_mode fsdp
--shard_mode zero3
--compile eager
--checkpoint_activations True
--low_precision_mode none
--micro_batch_size 1

Expected behavior

We should be able to run the benchmarking script, even if we are not able print a few metrics.

Environment

system.device_product_name DGXH100 system.gpu_driver_version 535.129.03 libraries.cuda 12.6.2.004 libraries.pip.lightning 2.4.0.dev20240728 libraries.pip.lightning-thunder 0.2.0.dev0 libraries.pip.lightning-utilities 0.11.7 libraries.pip.litgpt 0.4.11 libraries.pip.nvfuser 0.2.13+git4cbd7a4 libraries.pip.pytorch-lightning 2.4.0 libraries.pip.torch 2.6.0a0+gitd6d9183 libraries.pip.torchmetrics 1.4.2 libraries.pip.torchvision 0.19.0a0+d23a6e1

mpatel31415 avatar Oct 07 '24 16:10 mpatel31415

Hey @eqy this seems to be an eager mode bug, not related to thunder at all. Could you / group take a look at this?

tfogal avatar Oct 11 '24 16:10 tfogal

Actually it's related to benchmark_litgpt.py script. I know one possible fix for it, so I can prepare PR around Wednesday, but it won't solve missing results from calculate_model_flops function.

mpatel31415 avatar Oct 14 '24 13:10 mpatel31415

I think we might get similar problem in another place in the code:

0: [rank0]: Traceback (most recent call last): 0: [rank0]: File "/workspace/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 974, in 0: [rank0]: CLI(benchmark_main) 0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/jsonargparse/_cli.py", line 96, in CLI 0: [rank0]: return _run_component(components, init) 0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/jsonargparse/_cli.py", line 204, in _run_component 0: [rank0]: return component(**cfg) 0: [rank0]: ^^^^^^^^^^^^^^^^ 0: [rank0]: File "/workspace/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 917, in benchmark_main 0: [rank0]: print(f"TFLOP/s: {benchmark.perf_metrics['model_flop_per_sec'] / 1e12:.02f}") 0: [rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~ 0: [rank0]: TypeError: unsupported operand type(s) for /: 'NoneType' and 'float'

mpatel31415 avatar Nov 12 '24 09:11 mpatel31415

I think we might get similar problem in another place in the code:

Could we get a separate issue for this?

tfogal avatar Nov 22 '24 16:11 tfogal

While there's no separate issue, here's a pull request fixing the problem reported in https://github.com/Lightning-AI/lightning-thunder/issues/1267#issuecomment-2470086145: https://github.com/Lightning-AI/lightning-thunder/pull/1469

IvanYashchuk avatar Nov 25 '24 13:11 IvanYashchuk