Yan Wang
Yan Wang
Unfortunately this change doesn't help the performance on Llama-2-7b-hf Run benchmark with `torchrun --nproc_per_node=8 --nnodes=1 thunder/benchmarks/benchmark_litgpt.py --compile=thunder --distributed_mode=fsdp --micro_batch_size=2 --global_batch_size=16 --model_name=Llama-2-7b-hf --return_metrics_as_json=True --json_path=benchmark_litgpt_datanew.json` Env: `H100 80GB * 8, nvfuser: 0.2.3+git729f36c`...
> Is the wait operation now inserted at what seems like the right place to allow computation-communication overlap? What does Nsight Systems profiling tell about the overlap? Before this commit:...
Hi @t-vi @carmocca , I think it's ready to merge
For #277, after bisect I found https://github.com/Lightning-AI/lightning-thunder/commit/a76beb6328149b7799765a58ed38a892be39ca97 is the first bad commit. After comparing the trace before/after this commit, I found the order of allgathers changes. Before the commit, the...
Hi @IvanYashchuk @crcrpar , use the sort_wait_zero3(sort the allgather+wait just before consumer) + unlimited number of inflight allgather(push allgathers to the beginning of the trace) can fix the problem. it...
> Before merging let's see what's the impact on perf this latest iteration has. `torchrun --nproc_per_node=8 --nnodes=1 thunder/benchmarks/benchmark_litgpt.py --compile=thunder --distributed_mode=fsdp --micro_batch_size=2 --global_batch_size=16 --model_name=Llama-2-7b-hf` On main(b8705922c344a7d08f9ac43ac1b06d2ff7bbaf62) ``` Model name: Llama-2-7b-hf Seq...
> I would prefer that we functionalize the RNG state handling within thunder and I wonder if this could be achieved with moderate effort (so the problem is similar to...
Sure, let me know your availability
Hi @t-vi @mruberry , I think it's ready to merge
Hi @mruberry @jjsjann123 @IvanYashchuk , I modified it according our design review discussion, could you take a look?