Vadim Kantorov issues

Repositories
Issues
Comments

Results 213 issues of


                                            Vadim Kantorov

Batteries included: promote some basic version/utils of reasonably fast offline/batched inference into PyTorch core (maybe based on gpt-fast, nano-vllm, torchao)

Given how much LLM training (via FSDP) and inference (often with vllm) are both needed for RL/GRPO, I wonder if it's time to upstream some basic components / utils for...

OOM handling and recovery

We just hit OOM, revealing that by default torchtune does not use torch.compile and that it does not use fused linear cross entropy yet... I found the following report from...

discussion

OOM recovery under multi-node FSDP/HSDP

### Bug description Does torchtitan provide any recipes of how to implement batch skipping / OOM recovery in multi-node FSDP setup? In RL/GRPO training this is very pertinent (where we...

question

post training