pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

Improving Hydra+DDP support

Open jgbos opened this issue 3 years ago • 17 comments

What does this PR do?

There are documented issues for Hydra+DDP in #11300 and #2727. This PR attempts to fix these issues by redefining how a process is spawned when Hydra is used.

Fixes #11300

Problem

Current approach is defined here: https://github.com/PyTorchLightning/pytorch-lightning/blob/fe34bf2a653ebd50e6a3a00be829e3611f820c3c/pytorch_lightning/strategies/ddp.py#L233

This PR addresses the issue of running with Hydra multirun. For example, lets say we have the following Hydra app:

@hydra.main(config_path=None, config_name="myconfig", version_base="1.1")
def task_fn(cfg):
    trainer = Trainer(gpus=2, strategy="ddp")
    model = BoringModel()
    trainer.fit(model)

if __name__ == "__main__":
    task_fn()

We can execute a multirun job with this app using the following command:

python script.py foo=1,2 --multirun

This command will attempt to launch 2 jobs sequentially: one with foo=1 and one with foo=2. For the first job, foo=1, PL launchers a normal job that begins execution while the second job is spawned with a subprocess using the following command derived from sys.argv:

python script.py foo=1,2 --multirun hydra.run.dir=<os_cwd> hydra.job.name=train_ddp_process_<local_rank>

This will spawn a new mutlirun job instead of running a normal job with foo=2. The command should be

python script.py foo=2 hydra.run.dir=<os_cwd>

Solution

Every Hydra process will save the reproducible configuration of the job in a config.yaml file located in the hydra.output_subdir experiment directory. Using Hydra's CLI, we can execute the app with same configuration as the current experiment by defining --config_path, -cp and --config_name, -cn:

python script.py -cp <path to hydra output dir> -cn config.yaml hydra.run.dir=<os_cwd> 

Here config.yaml contains the value for foo appropriate for the current multirun job. I've outlined the support for multirun, but this should support any Hydra application launched from the command line.

Lingering Issue Not Solved

In order to run multirun on a local machine the user must add additional code to their task function before launching the next job. This code, shown below, will destroy all distributed processes and remove PL related environment variables. Without this code the multirun job will hang after the first job.

@hydra.main(config_path=None, config_name="myconfig", version_base="1.1")
def task_fn(cfg):
    trainer = Trainer(gpus=2, strategy="ddp")
    model = BoringModel()
    trainer.fit(model)
    trainer.test(model)

    # Need to do this in addition to Lightning shutting down the
    # distributed processes in order to run a multirun loop with hydra
    if dist.is_initialized():
        dist.destroy_process_group()

    os.environ.pop("LOCAL_RANK", None)

Thoughts

This solution should help most people, but IMHO the current method of spawning using sys.argv is not robust. It would be nice to be able to execute a PL application similar to how I implemented the Hydra solution. This would require dynamically creating a configurations for the application. The use of save_hyperparameters is nice but is only done inside the creation of a LightningModule, it would be nice to create a description of the model at a higher level — outside of the module. By defining the description outside of the module, one could use a similar approach as the Hydra solution above for PL applications. I would recommend taking a look at hydra-zen (FYI, I'm a contributor).

Does your PR introduce any breaking changes? If yes, please list them.

None

Before submitting

Personal TODO:

  • [x] Add test checking processes use the correct parameters when spawned

PL TODO:

  • [x] Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • [x] Did you read the contributor guideline, Pull Request section?
  • [x] Did you make sure your PR does only one thing, instead of bundling different changes together?
  • [ ] Did you make sure to update the documentation with your changes? (if necessary)
  • [x] Did you write any new necessary tests? (not for typos and docs)
  • [x] Did you verify new and existing tests pass locally with your changes?
  • [x] Did you list all the breaking changes introduced by this pull request?
  • [ ] Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

jgbos avatar Jan 25 '22 14:01 jgbos

@tchaton I'm working through some issues with teardown. It appears there are issues with the process hanging on torch.distributed.destroy_process_group(), but if I do it in my task function outside of trainer.fit I do not have any issues. Is there a different spot to put the code currently in teardown?

jgbos avatar Jan 31 '22 23:01 jgbos

@rohitgr7 We could potentially have a hydra specific launcher after #11643 is finalized. In that sense, I like the approach of this PR of creating a utility function encapsulating the hydra command.

awaelchli avatar Feb 10 '22 00:02 awaelchli

@rohitgr7 We could potentially have a hydra specific launcher after #11643 is finalized. In that sense, I like the approach of this PR of creating a utility function encapsulating the hydra command.

@awaelchli pushed hydra launcher here. is this something you had in mind?

rohitgr7 avatar Feb 22 '22 15:02 rohitgr7

Just a heads up to all, I've been on vacation for the past week. I'll look at all the changes and try to respond to the comments this week.

jgbos avatar Mar 01 '22 13:03 jgbos

So I'm attempting to revive this PR. I got everything running but there are still some outstanding issues to address. The main issues are:

  • Hydra now comes with version_base, we need to handle 1.1 and 1.2
  • The Hydra task functions needs to clean up the environment and shutdown distributed in order to execute multirun (at least using basic sweeper and launchers)
  • The unresolved issued provided @Jasha10 above

Here is the basic task function (see tests/tests_pytorch/strategies/test_ddp_hydra_support.py):

@hydra.main(config_path=None, version_base="1.1")
def task_fn(cfg):
    trainer = Trainer(accelerator="auto", devices=cfg.devices, strategy=cfg.strategy, fast_dev_run=True)
    model = BoringModelGPU()
    trainer.fit(model)
    trainer.test(model)

    # the code below is needed for multirun only
    if torch.distributed.is_initialized():
        torch.distributed.destroy_process_group()

    os.environ.pop("LOCAL_RANK", None)

jgbos avatar Aug 12 '22 03:08 jgbos

I added support for Hydra Re-Run (https://hydra.cc/docs/1.2/experimental/rerun/).

I'm trying to figure out why tests are failing in actions but do fine on my machine. Not sure I'm understanding how to setup tests with the latest updates.

jgbos avatar Aug 16 '22 17:08 jgbos

I'm thinking now that _SubprocessScriptLauncher should have a method to compute the subprocess command:

def _subprocess_cmd(self, local_rank: int) -> Sequence[str]:
    ...

Then we could have the base subprocess launcher (no hydra) and the hydra specific launcher. We could then make a separate strategy hydra_ddp.

Any thoughts on this? It would allow users to easily customize the subprocess command.

jgbos avatar Aug 17 '22 01:08 jgbos

@jgbos You chose to get rid of the launcher subclass, right?

We could then make a separate strategy hydra_ddp.

Why would we want this? This should be invisible to users

carmocca avatar Aug 22 '22 14:08 carmocca

Hi @carmocca , thanks for the comments. wrt to the sublcassing, let me clarify. I am wondering if we should update _SubprocessScriptLauncher to have a _subprocess_cmd method:


class _SubprocessScriptLauncher(_Launcher):
    ...

    def _subprocess_cmd(self, local_rank: int) -> Sequence[str]:
        if __main__.__spec__ is None:  # pragma: no-cover
            return [sys.executable, os.path.abspath(sys.argv[0])] + sys.argv[1:]
        else:
            return [sys.executable, "-m", __main__.__spec__.name] + sys.argv[1:]

    def _call_children_scripts(self) -> None:
        ...
        for local_rank in range(1, self.num_processes):
            ...
            cmd = self.subprocess_cmd(local_rank)
            subprocess.Popen(command, env=env_copy)
            ....

Then we could make specific subprocess launchers that can be easily customized by the user, for instance Hydra:

class _HydraSubprocessScriptLauncher(_SubprocessScriptLauncher):

    def _subprocess_cmd(self, local_rank: int) -> Sequence[str]:
        # return the subprocess command
        ...

Since this is a new launcher we'd have to make a new DDPStrategy:

class HydraDDPStrategy(DDPStrategy):
    strategy_name = "hydra_ddp"

    def _configure_launcher(self) -> None:
        assert self.cluster_environment is not None
        if not self.cluster_environment.creates_processes_externally:
            self._launcher = _HydraSubprocessScriptLauncher(self.cluster_environment, self.num_processes, self.num_nodes)
            self._rank_0_will_call_children_scripts = True

    @classmethod
    def register_strategies(cls, strategy_registry: Dict) -> None:
        strategy_registry.register(
            "hydra_ddp_find_unused_parameters_false",
            cls,
            description="Hydra DDP Strategy with `find_unused_parameters` as False",
            find_unused_parameters=False,
        )
        strategy_registry.register(
            cls.strategy_name,
            cls,
            description=f"{cls.__class__.__name__}",
        )

I like the approach of explicitly defining the strategy that supports Hydra (and other customizable launchers). I haven't pushed this change because I was unsure of PL's opinion on this.

jgbos avatar Aug 22 '22 14:08 jgbos

The subprocess_cmd refactor makes sense to me. But I would definitely avoid the need to introduce a separate strategy. The fact that Hydra needs a custom launch command does not mean it needs a custom strategy. We should select the correct launcher automatically.

cc @awaelchli

carmocca avatar Aug 22 '22 16:08 carmocca

Sounds good @carmocca . I extracted the subprocess command but I still have it automatically selecting based on whether Hydra was initialized.

jgbos avatar Aug 27 '22 05:08 jgbos

Codecov Report

Merging #11617 (0f12e9a) into master (e53c4e8) will increase coverage by 22%. The diff coverage is 38%.

:exclamation: Current head 0f12e9a differs from pull request most recent head e2bdb76. Consider uploading reports for the commit e2bdb76 to get more accurate results

@@            Coverage Diff            @@
##           master   #11617     +/-   ##
=========================================
+ Coverage      51%      73%    +22%     
=========================================
  Files         329      332      +3     
  Lines       26640    30808   +4168     
=========================================
+ Hits        13657    22546   +8889     
+ Misses      12983     8262   -4721     

codecov[bot] avatar Aug 27 '22 13:08 codecov[bot]

Great work. @jgbos If you don't mind, could you look at the failing tests here (ignore all other jobs that are failing): https://dev.azure.com/Lightning-AI/lightning/_build/results?buildId=93294&view=logs&j=3f274fac-2e11-54ca-487e-194c91f3ae9f&t=fd22e4d0-068c-5651-f467-789f79e7bb7b

IMO we need to make the test utility raise the full stack trace from the subprocess, otherwise future failings in these tests will be harder to debug.

awaelchli avatar Sep 01 '22 14:09 awaelchli

@awaelchli Thanks. I agree we need more info from the crash as I'm not sure why it is crashing either. I'll have to figure out how to reproduce this on my machine.

jgbos avatar Sep 06 '22 14:09 jgbos

@jgbos Looks like there are merge conflicts. Could you merge to master and we can see the CI results again?

turian avatar Sep 12 '22 14:09 turian

Are hydra-core 1.1 and 1.2 both necessary to support?

Perhaps consider pinning hydra-core to 1.2.0 for now?

See also: https://github.com/Lightning-AI/lightning/pull/13309/files

turian avatar Sep 12 '22 17:09 turian

@turian I'll double check but the requirement, hydra-core>=1.0.5, <1.3.0 should work in general. I'll look into updating the rerun tests as they require version >=1.2

jgbos avatar Sep 13 '22 19:09 jgbos

Finally merged! Thank you so much for your time @jgbos

carmocca avatar Sep 22 '22 16:09 carmocca

@jgbos @awaelchli @carmocca @rohitgr7 It seems this PR cause some issues.

  1. PYTHONPATH issue. When I use ddp with 2 devices (gpu), the second process cannot access my own module. There is no issue with pl==1.7.7. PYTHONPATH=.:$PYTHONPATH #my cmd Traceback (most recent call last): File "/home/$username/diskey-scale/diskey/main.py", line 11, in <module> from diskey.conf.config import MainConfig ModuleNotFoundError: No module named 'diskey'

  2. ddp deadlock after changing PYTHONPATH to PYTHONPATH=$HOME/diskey-scale:$PYTHONPATH ModuleNotFoundError is solved. However, the following error will be triggered. hydra.errors.ConfigCompositionException: 'config.yaml' is validated against ConfigStore schema with the same name. This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2. In addition, the automatically matched schema contains a defaults list. This combination is no longer supported. See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.

These issues was not happended with hydra-submitit-launcher (with slurm).

Liangtaiwan avatar Oct 22 '22 15:10 Liangtaiwan

@Liangtaiwan Not sure what is going on there. Could you report the exact issue, steps to reproduce, and environment you run in a GitHub issue? Perhaps @jgbos will be able to help too? Thanks!

awaelchli avatar Oct 23 '22 15:10 awaelchli

@Liangtaiwan sorry, I would need to know more about the error. Definitely open an issue and I will try to help. Would be good to know Hydra version and how you executed the script. Also, this error should not occur if launching on slurm as each gpu task is launched via the submission script (i.e., submitit launcher) instead of this script.

jgbos avatar Oct 24 '22 13:10 jgbos