Kurt Shuster

Results 198 comments of Kurt Shuster

mutators might be the way to go here; closing this issue as there's been no activity

This seems to no longer be an issue

- [ ] provide an easy way to step through model forward passes to actually examine outputs of the modules

Assuming we have the following: ``` CHECKPOINT=/path/to/fsdp_sharded_checkpoint/checkpoint_last CONSOLIDATED=/path/to/new_consolidated_checkpoint/ RESHARDED=/path/to/new_resharded_checkpoint/ MP=16 ``` ### Step 0 (Optional, if necessary) [Consolidate the model](https://github.com/facebookresearch/metaseq/blob/bbcedfebb4c35f71cdda1f1a358491f3996a9fc3/metaseq/scripts/consolidate_fsdp_shards.py) from the FSDP shards into one checkpoint: ```bash python consolidate_fsdp_shards.py...

still fine to init as megatron model

have you properly installed apex for fp16 support? that's the first thing that comes to mind as to why you might be experiencing OOMs; 16 x 32gb GPUs is plenty...

I pushed some changes to update the UI and also the queue logic. will add a screenshot once I can load the thing