bhack
bhack
Yes generally it could be hard mining cases, highres, HW constrains etc.. So I think that also in Vision we really have the same type of fine-tuning needs. I really...
E.g. see how many comments we had on the original SAM just related to fine-tuning: https://github.com/facebookresearch/segment-anything/issues/5
Also just to make another example. Your WIP RLHF with PPO https://github.com/pytorch/torchtune/pull/1005 or other approaches like that could be still useful in Vision/Multimodal https://encord.com/blog/guide-to-rlhf/ So I think this is why...
There are some Torch compile issues with these models: https://github.com/pytorch/pytorch/issues/103716
I meant that I am trying to test single GPU. In many places `cfg.DIST_ENABLE` is checked to safely go through the non distributed code path. But in many other places...
E.g. here instead the code is checking `cfg.DIST_ENABLE` https://github.com/yoxu515/aot-benchmark/blob/1c3a5ec51d81f3e17ff9092aa1e830206d766132/networks/managers/trainer.py#L59-L83
Is this not in the trainer? I am trying to train a single GPU job with `cfg.DIST_ENABLE=False`
Yes it is what I meant. Are we not going to have issue if we don't conditional wrap `torch.nn.parallel.DistributedDataParallel` in the trainer?
Isn't that one going to require `init_process_group`? But it is conditional wrapped in the trainer. https://github.com/yoxu515/aot-benchmark/blob/1c3a5ec51d81f3e17ff9092aa1e830206d766132/networks/managers/trainer.py#L59-L64
E.g. with `self.DIST_ENABLE = False` in `configs/default.py` we are going to fail directly at: https://github.com/yoxu515/aot-benchmark/blob/1c3a5ec51d81f3e17ff9092aa1e830206d766132/networks/managers/trainer.py#L342 ```python RuntimeError: Default process group has not been initialized, please make sure to call init_process_group....