Thien Tran

Results 55 issues of Thien Tran

#### Context What is the purpose of this PR? Is it to - [x] add a new feature - [ ] fix a bug - [ ] update tests and/or...

CLA Signed

Right now when `compile=True`, only the model is compiled https://github.com/pytorch/torchtune/blob/e10142016798cf84f2e5c638a985014384f400a7/recipes/lora_finetune_single_device.py#L383-L386 We can further boost performance by including loss calculations in compile step. From my benchmarks, the improvement is pretty significant....

enhancement

The recent addition of optimizer CPU offload in torchao can be useful for single GPU low memory config. https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload In my brief testing https://github.com/pytorch/torchtune/compare/main...gau-nernst:torchtune:optim_offload, there is **~25% increase in tok/s**....

When I searched for "profiler", only these 2 recipes support profiling - lora_finetune_single_device.py - lora_finetune_distributed.py Is there a reason other recipes don't have profling option? In particular, I'm trying to...

### 🐛 Describe the bug Between 20240823 and 20240824 version, there is a serious memory leak. The latest nightly 20240827 still has the issue Without compile, there is no memory...

high priority
triage review
oncall: pt2

Recently I worked on INT8 mixed-precision training in torchao. The relevant PR is here https://github.com/pytorch/ao/pull/748 Preliminary results show that with torchtitan, it improves speed by 20% on 8x A100 with...

enhancement

New errors with PyTorch nightly (both CPU and CUDA) https://github.com/pytorch/ao/actions/runs/11087952684/job/30807207991 cc @jerryzh168

With the new addition of INT8 mixed-precision training, there are now 2 implementations of scaled INT8 matmul (INT8 matmul + dequant) - https://github.com/pytorch/ao/blob/main/torchao/kernel/intmm_triton.py - https://github.com/pytorch/ao/blob/main/torchao/prototype/quantized_training/int8_mm.py I have identified the key...

Just saw that PyTorch core made a branch cut for 2.5 release. Now nightly is 2.6. Probably a good idea to add CI to run on PyTorch 2.5rc? (don't even...

While trying out INT8 mixed precision pretraining (#748) with torchtitan, I came across an issue that if the model is FSDP-sharded, `quantize_()` won't work. The fix would be adding an...

triaged