Thien Tran issues

Results 55 issues of


                                            Thien Tran

Add CPU offload optimizer from torchao

#### Context What is the purpose of this PR? Is it to - [x] add a new feature - [ ] fix a bug - [ ] update tests and/or...

CLA Signed

[FEATURE REQUEST] Compile model+loss_fn together

Right now when `compile=True`, only the model is compiled https://github.com/pytorch/torchtune/blob/e10142016798cf84f2e5c638a985014384f400a7/recipes/lora_finetune_single_device.py#L383-L386 We can further boost performance by including loss calculations in compile step. From my benchmarks, the improvement is pretty significant....

enhancement

[RFC] Optimizer CPU offload from torchao for single GPU low memory config

The recent addition of optimizer CPU offload in torchao can be useful for single GPU low memory config. https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload In my brief testing https://github.com/pytorch/torchtune/compare/main...gau-nernst:torchtune:optim_offload, there is **~25% increase in tok/s**....

[feature request] Profiler for other recipes

When I searched for "profiler", only these 2 recipes support profiling - lora_finetune_single_device.py - lora_finetune_distributed.py Is there a reason other recipes don't have profling option? In particular, I'm trying to...

Memory leak starting with torch==2.5.0dev20240824 during training

### 🐛 Describe the bug Between 20240823 and 20240824 version, there is a serious memory leak. The latest nightly 20240827 still has the issue Without compile, there is no memory...

high priority

triage review

oncall: pt2

Support INT8 mixed-precision training from torchao?

Recently I worked on INT8 mixed-precision training in torchao. The relevant PR is here https://github.com/pytorch/ao/pull/748 Preliminary results show that with torchtitan, it improves speed by 20% on 8x A100 with...

enhancement

`test_choose_qparams_token_asym` failing in nightly

New errors with PyTorch nightly (both CPU and CUDA) https://github.com/pytorch/ao/actions/runs/11087952684/job/30807207991 cc @jerryzh168

Unify scaled INT8 matmul

With the new addition of INT8 mixed-precision training, there are now 2 implementations of scaled INT8 matmul (INT8 matmul + dequant) - https://github.com/pytorch/ao/blob/main/torchao/kernel/intmm_triton.py - https://github.com/pytorch/ao/blob/main/torchao/prototype/quantized_training/int8_mm.py I have identified the key...

[CI] Add CI test for PyTorch 2.5.0rc

Just saw that PyTorch core made a branch cut for 2.5 release. Now nightly is 2.6. Probably a good idea to add CI to run on PyTorch 2.5rc? (don't even...

[Quantization + FSDP] Support `quantize_()` for DTensor

While trying out INT8 mixed precision pretraining (#748) with torchtitan, I came across an issue that if the model is FSDP-sharded, `quantize_()` won't work. The fix would be adding an...

triaged