Jason Chou
Jason Chou
I don't know exactly what @punitkoura is working on, but the fact that 1. I can load OPT-2.7B within reasonable time now as long as the world size matches (i.e....
Hi @punitkoura, could you share a bit about the root cause and the current status? I would be more than happy to help if there is something I can do!
@punitkoura I happen to be running PyTorch 1.12 so I will give it a try, but questions: 1. Was `convert_to_singleton` hanging because one of the processes died of OOM? Ideally...
Using #430 `convert_to_singleton` completed successfully after writing `restored.pt` 🎉 However, `metaseq-api-local` failed to load from it as it tries to put the whole model on `GPU 0`: ``` $ metaseq-api-local...
@punitkoura I just want to load the model and run inference (i.e. sentence completion). What should I put as `MODEL_FILE` in `constants.py` in order to load the model parallel model?...
@punitkoura I think this is what you meant but it didn't work: Firstly, I manually renamed the files ``` for i in {0..7} do mv reshard-model_part-$i-shard0.pt reshard-model_part-$i.pt done ``` such...
@punitkoura running off `origin/punitkoura/debug-407` (fbcf3e35b552126f0bfa8ef40f93b11614aaa2f8) with no change other than `CHECKPOINT_FOLDER = "/home/jason_chou/redspot_home/66b/"`: ``` $ metaseq-api-local 2022-10-27 02:18:18 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/66b/reshard.pt In load_model_ensemble_and_task filenames...
@punitkoura At 8500e88 I got: ``` $ metaseq-api-localcfg.distributed_training = {'_name': None, 'distributed_world_size': 8, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': None, 'distributed_port': 13000, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'fully_sharded', 'bucket_cap_mb': 25, 'fix_batches_to_gpus':...
@punitkoura 517d7ad indeed works 🎉: ``` $ git checkout remotes/origin/punitkoura/debug-407 M metaseq/service/constants.py Previous HEAD position was 8500e88 Add logging HEAD is now at 517d7ad Add localhost $ $ metaseq-api-local cfg.distributed_training...
@punitkoura Could you take a look?