Yan Wang comments

Results 78 comments of


                                            Yan Wang

Replacing uniform with uniform_philox , and incrementing PRNG state

sorry for the trouble, I should get that draft status on

Remove all occurances of thunder.compile and TestExecutor.make_callable_legacy

`jit(vjp(fn), disable_torch_autograd=True)` doesn't always work. for example, with the patch below, run `pytest test_grad.py -vs -k test_vjp_correctness_abs_nvfuser_cuda_float64`, will hit `NotImplementedError`, we get an `OPAQUE` whose first input is not `PseudoInst.CONSTANT`,...

Remove all occurances of thunder.compile and TestExecutor.make_callable_legacy

There are some cases I don't know how to fix, I'm going to get the specific case to reproduce in a separate issue and get some help from JIT experts.

Remove all occurances of thunder.compile and TestExecutor.make_callable_legacy

There's a similar issue here: https://github.com/Lightning-AI/lightning-thunder/issues/283 cc: @IvanYashchuk @t-vi

`thunder.distributed.utils.sort_waits` is broken

After bisect I found a76beb6328149b7799765a58ed38a892be39ca97 is the first bad commit. It can be reproduced by running ` python ../thunder/benchmarks/distributed.py --world-size 2 --model Llama-2-7b-hf -D fsdp --bucketing-strategies none --sharding-strategies zero2 --skip-torch`...

Add the information about the set of auto-registered torch operators in CompileStats

Hi @t-vi , in the forward trace the operator is noted the same as thunder.torch symbol or torch operator(execuction trace) when auto-registered, but in the backward trace the auto-registered operator...

Add the information about the set of auto-registered torch operators in CompileStats

If we just look at the text format of the traces it's hard to tell, but when we get to the symbol.meta level we can always see the difference in...

Add the information about the set of auto-registered torch operators in CompileStats

>We can probably do that without too much of an issue. @kiya00, what would you think of lazily populating this statistic by inspecting the first trace? Hi @t-vi @mruberry! Sure,...

RuntimeError and Socket Connection Failure when Benchmarking Gemma-7b with Micro Batch Size 1

It seems I didn't reproduce this error on H100 80GB, instead I got OOM (I removed the `--save_logs_for_all_batches True`) ``` container: pjnl-20240801 lightning-thunder 0.2.0.dev0 /opt/pytorch/lightning-thunder nvfuser 0.2.8+git671171f /opt/pytorch/nvfuser ``` ```...

RuntimeError and Socket Connection Failure when Benchmarking Gemma-7b with Micro Batch Size 1

Yes, I used 1 node with 8 H100(80G). Has anyone else tried it to see if it's reproducible?