Yan Wang

Results 78 comments of Yan Wang

sorry for the trouble, I should get that draft status on

`jit(vjp(fn), disable_torch_autograd=True)` doesn't always work. for example, with the patch below, run `pytest test_grad.py -vs -k test_vjp_correctness_abs_nvfuser_cuda_float64`, will hit `NotImplementedError`, we get an `OPAQUE` whose first input is not `PseudoInst.CONSTANT`,...

There are some cases I don't know how to fix, I'm going to get the specific case to reproduce in a separate issue and get some help from JIT experts.

There's a similar issue here: https://github.com/Lightning-AI/lightning-thunder/issues/283 cc: @IvanYashchuk @t-vi

After bisect I found a76beb6328149b7799765a58ed38a892be39ca97 is the first bad commit. It can be reproduced by running ` python ../thunder/benchmarks/distributed.py --world-size 2 --model Llama-2-7b-hf -D fsdp --bucketing-strategies none --sharding-strategies zero2 --skip-torch`...

Hi @t-vi , in the forward trace the operator is noted the same as thunder.torch symbol or torch operator(execuction trace) when auto-registered, but in the backward trace the auto-registered operator...

If we just look at the text format of the traces it's hard to tell, but when we get to the symbol.meta level we can always see the difference in...

>We can probably do that without too much of an issue. @kiya00, what would you think of lazily populating this statistic by inspecting the first trace? Hi @t-vi @mruberry! Sure,...

It seems I didn't reproduce this error on H100 80GB, instead I got OOM (I removed the `--save_logs_for_all_batches True`) ``` container: pjnl-20240801 lightning-thunder 0.2.0.dev0 /opt/pytorch/lightning-thunder nvfuser 0.2.8+git671171f /opt/pytorch/nvfuser ``` ```...

Yes, I used 1 node with 8 H100(80G). Has anyone else tried it to see if it's reproducible?