Masaki Kozuki issues

Results 86 issues of


                                            Masaki Kozuki

[fsdp] Add option of AllGather rate limiting (mainly for ZERO2)

Adding the new argument of `apply_rate_limting` to `thunder.distributed.fsdp` so that we can try rate limiting of AllGather, for especially when ZERO2 is used. The major changes this pr brings are...

distributed

Move Grad AllReduce Bucketing to inside of `thunder.executors.passes.transform_for_execution` from `torch_autograd.split_forward_backward`

## What does this PR do? Fixes #184 As per title. The changes are addition of a logic to tell whether or not the input `TraceCtx` represents DDP backward. cc...

distributed

[benchmark] add option to use torchao's float8 of dynamic scaling with fsdp2

This adds an option to use float8 of [torchao](https://github.com/pytorch/ao). An example command to use float8 with FSDP2: ``` torchrun --nproc-per-node 8 --local-ranks-filter 0 --role rank --tee 3 thunder/benchmarks/benchmark_litgpt.py --model_name Llama-2-7b-hf...

[benchmark] migrate to fsdp/ddp after jit, from fsdp/ddp before jit

## Llama-2-7b-hf & fsdp container of 20240804. 8 H100 80GB HBM3. command: `torchrun --nproc_per_node=8 thunder/benchmarks/benchmark_litgpt.py --compile=thunder_inductor_cat --model_name=Llama-2-7b-hf --distributed_mode=fsdp --shard_mode=zero2 --bucketing_mode=none` | | main 5cc3011 | pr 3869547 | |--------------|--------------|-----------------| |...

`state_dict` transform for `fsdp(jit(model))`

## What does this PR do? Transform for `fsdp(jit(model)).state_dict()` by all-gathering and unpadding params

Record `Tensor.clone()` in trace

It should work if there's a clone between reshape and in-place operation. Currently, it fails with the same error because `clone` is not recorded on the trace: https://github.com/Lightning-AI/lightning-thunder/blob/ffbebe07bdf003c3a60f2cab88298e96b80bdbba/thunder/torch/__init__.py#L2521-L2525 In the...

operators

Masaki Kozuki

[fsdp] Add option of AllGather rate limiting (mainly for ZERO2)

Move Grad AllReduce Bucketing to inside of `thunder.executors.passes.transform_for_execution` from `torch_autograd.split_forward_backward`

[benchmark] add option to use torchao's float8 of dynamic scaling with fsdp2

[benchmark] migrate to fsdp/ddp after jit, from fsdp/ddp before jit

`state_dict` transform for `fsdp(jit(model))`

Record `Tensor.clone()` in trace

[functionalization] error out transpose

multiple in-place ops to func's arg tensors that are a view of another

add "_gemm_input_role" to dunder slots

Disallow traceable tensor subclasses