Masaki Kozuki

Results 86 issues of Masaki Kozuki

Adding the new argument of `apply_rate_limting` to `thunder.distributed.fsdp` so that we can try rate limiting of AllGather, for especially when ZERO2 is used. The major changes this pr brings are...

distributed

## What does this PR do? Fixes #184 As per title. The changes are addition of a logic to tell whether or not the input `TraceCtx` represents DDP backward. cc...

distributed

This adds an option to use float8 of [torchao](https://github.com/pytorch/ao). An example command to use float8 with FSDP2: ``` torchrun --nproc-per-node 8 --local-ranks-filter 0 --role rank --tee 3 thunder/benchmarks/benchmark_litgpt.py --model_name Llama-2-7b-hf...

## Llama-2-7b-hf & fsdp container of 20240804. 8 H100 80GB HBM3. command: `torchrun --nproc_per_node=8 thunder/benchmarks/benchmark_litgpt.py --compile=thunder_inductor_cat --model_name=Llama-2-7b-hf --distributed_mode=fsdp --shard_mode=zero2 --bucketing_mode=none` | | main 5cc3011 | pr 3869547 | |--------------|--------------|-----------------| |...

## What does this PR do? Transform for `fsdp(jit(model)).state_dict()` by all-gathering and unpadding params

It should work if there's a clone between reshape and in-place operation. Currently, it fails with the same error because `clone` is not recorded on the trace: https://github.com/Lightning-AI/lightning-thunder/blob/ffbebe07bdf003c3a60f2cab88298e96b80bdbba/thunder/torch/__init__.py#L2521-L2525 In the...

operators

## What does this PR do? As per title.

Looks good to me, I wonder if we want a note about aliases as args? ```python def f(a, b): return a.exp_().sin_() + b.exp_().sin_() a = torch.randn(5, 5) f(a, a) ```...

aliasing

as I didn't see the strong reason to exclude it from the slots

CLA Signed

## What does this PR do? Conservatively errors out when any of tensor inputs are a traceable tensor subclass and any non-pytorch executor is enabled. Ideally we should interpret, translate,...