Richard Gong
Richard Gong
I'm also running into this (albeit with 4 A100 80GB). Wondering if there is a way we can work around it - happy to make a contribution if the direction...
I found a workaround which involves allowing CPU offloading during the phase of saving the state dict. I tested that end-to-end 70B training works with checkpointing on [this repo](https://github.com/modal-labs/llama-finetuning). I...
The issue is reproducible with `min_size=100, max_size=100`. Increasing the size of the pool is not a feasible workaround here. Crucially, awaiting a new connection from the pool **should not block...