Kurt Shuster
Kurt Shuster
mutators might be the way to go here; closing this issue as there's been no activity
This seems to no longer be an issue
- [ ] provide an easy way to step through model forward passes to actually examine outputs of the modules
Assuming we have the following: ``` CHECKPOINT=/path/to/fsdp_sharded_checkpoint/checkpoint_last CONSOLIDATED=/path/to/new_consolidated_checkpoint/ RESHARDED=/path/to/new_resharded_checkpoint/ MP=16 ``` ### Step 0 (Optional, if necessary) [Consolidate the model](https://github.com/facebookresearch/metaseq/blob/bbcedfebb4c35f71cdda1f1a358491f3996a9fc3/metaseq/scripts/consolidate_fsdp_shards.py) from the FSDP shards into one checkpoint: ```bash python consolidate_fsdp_shards.py...
still fine to init as megatron model
have you properly installed apex for fp16 support? that's the first thing that comes to mind as to why you might be experiencing OOMs; 16 x 32gb GPUs is plenty...
I pushed some changes to update the UI and also the queue logic. will add a screenshot once I can load the thing