Yan Wang
Yan Wang
It can be reproduced in container `pjnl-20240830-mixology_70d843cd` but not `pjnl-20240830`
A minimum reproduce is: ``` import torch, thunder def fun(x): x = x * torch.tensor(0.5, dtype=x.dtype) return x x = torch.randn((2,2),dtype=torch.bfloat16).cuda() # print(fun(x)) jfun=thunder.jit(fun) jfun(x) ``` Torch can run `cuda...
Hi @t-vi @IvanYashchuk , we could discuss further if https://github.com/Lightning-AI/lightning-thunder/pull/976 is necessary, this is a bug I found along the way, so I split it out and we could review...
I didn't exclude the operators return view in the auto registration, as Ivan mentioned the stride information is not used. And now I find I didn't add the tensor view...
Hi @t-vi @IvanYashchuk , I rephrased a bit, the main purpose of this notebook is to give an example of writing a simple functional python function for a pytorch module...
If we run this notebook in CI using the hugging face weights, the HF_TOKEN is needed and the weight is needed to download in `Meta-Llama-3-8B/consolidated.00.pth` under the same folder of...
>The other question I'd have is if our use of the code is OK here (did we ask the gist author, do we think that the notebook is affected by...
The only failed case(https://github.com/Lightning-AI/lightning-thunder/pull/837#issuecomment-2245732546) I can reproduce locally is the `FAILED thunder/tests/test_grad.py::test_vjp_correctness_adaptive_avg_pool2d_torch_cuda_thunder.dtypes.float64 - NotImplementedError: VJP for torch.nn.functional.adaptive_avg_pool2d is not implemented`, the reason is because this op only uses torchex.grad_transform. when...
Here is a comparison of memory usage on 1 node(8*H100) zero3 vs. single H100 with different number of layers | | micr_Bs=2,glb_bs=2 | | | zero3 micr_Bs=2,glb_bs=16 | | |...
By further reducing the n_layers=1 of Llama-2-13b-hf, the memory usage is related to the rematerialization in this case, the only difference part is the memory allocated by ` [t93, t103,...