Yan Wang comments

Results 78 comments of


                                            Yan Wang

Fix `sort_waits` to move `wait` closer to its consumer (#277)

Unfortunately this change doesn't help the performance on Llama-2-7b-hf Run benchmark with `torchrun --nproc_per_node=8 --nnodes=1 thunder/benchmarks/benchmark_litgpt.py --compile=thunder --distributed_mode=fsdp --micro_batch_size=2 --global_batch_size=16 --model_name=Llama-2-7b-hf --return_metrics_as_json=True --json_path=benchmark_litgpt_datanew.json` Env: `H100 80GB * 8, nvfuser: 0.2.3+git729f36c`...

Fix `sort_waits` to move `wait` closer to its consumer (#277)

> Is the wait operation now inserted at what seems like the right place to allow computation-communication overlap? What does Nsight Systems profiling tell about the overlap? Before this commit:...

Fix `sort_waits` to move `wait` closer to its consumer (#277)

Hi @t-vi @carmocca , I think it's ready to merge

Fix `sort_waits` to move `wait` closer to its consumer (#277)

For #277, after bisect I found https://github.com/Lightning-AI/lightning-thunder/commit/a76beb6328149b7799765a58ed38a892be39ca97 is the first bad commit. After comparing the trace before/after this commit, I found the order of allgathers changes. Before the commit, the...

Fix `sort_waits` to move `wait` closer to its consumer (#277)

Hi @IvanYashchuk @crcrpar , use the sort_wait_zero3(sort the allgather+wait just before consumer) + unlimited number of inflight allgather(push allgathers to the beginning of the trace) can fix the problem. it...

Fix `sort_waits` to move `wait` closer to its consumer (#277)

> Before merging let's see what's the impact on perf this latest iteration has. `torchrun --nproc_per_node=8 --nnodes=1 thunder/benchmarks/benchmark_litgpt.py --compile=thunder --distributed_mode=fsdp --micro_batch_size=2 --global_batch_size=16 --model_name=Llama-2-7b-hf` On main(b8705922c344a7d08f9ac43ac1b06d2ff7bbaf62) ``` Model name: Llama-2-7b-hf Seq...

Yan Wang

Fix `sort_waits` to move `wait` closer to its consumer (#277)

Fix `sort_waits` to move `wait` closer to its consumer (#277)

Fix `sort_waits` to move `wait` closer to its consumer (#277)

Fix `sort_waits` to move `wait` closer to its consumer (#277)

Fix `sort_waits` to move `wait` closer to its consumer (#277)

Fix `sort_waits` to move `wait` closer to its consumer (#277)

Add prim operators to query/update CUDA default RNG state

Add prim operators to query/update CUDA default RNG state

Add prim operators to query/update CUDA default RNG state

Add prim operators to query/update CUDA default RNG state