fpga-tool-perf icon indicating copy to clipboard operation
fpga-tool-perf copied to clipboard

[WIP] [RFC] Adds experiment management using Hydra

Open syed-ahmed opened this issue 4 years ago • 19 comments

This PR adds support for launching fpga-tool-perf benchmarks using Hydra.

fpga-tool-perf supports multiple benchmarks for multiple boards for multiple backends. There is a need for a tool that can compose configurations for any valid combination of these groups. In addition, launching them in parallel, running parameter sweeps, using search optimization strategies will become essential in speeding up the benchmarks. Hydra does it all:

  • It can dynamically create a hierarchical configuration by composition.
  • It provides built in support for several different launchers: basic, joblib, slurm, ray etc.
  • It provides built in support for auto-tuning frameworks: optuna, ax, nevergrad etc.

While it provides a lot of benefits, some of the things hydra enforces might be off-putting (although you can really debate about it). For instance:

  • Hydra assumes that you will replace all your argparse code with the hydra configuration. As a result, the command line syntax looks a little different, and help message will have to be updated every time a new argument is added to the app.
  • You cannot pass a flag like you can in argparse. For instance --list-combinations is changed to list=combinations. The only way to pass a bool would be list-combinations=True which is a bit verbose (but list=combinations doesn't really sound/mean right but looks alright?)

My plan for this PR is to:

  • [x] Refactor fpgaperf.py to be more hydra friendly
  • [ ] Update readme to reflect changes
  • [ ] Merge exhuast.py functionality into fpgaperf.py since fpgaperf.py with hydra achieves the same functionality as exhaust.py
  • [ ] Replace run_config functionality with hydra experiment pattern
  • [ ] Demonstrate parameter sweep/auto-tuning on a cluster

Current Progress

You can run the fpgaperf.py tool locally on multiple cores or on a slurm cluster. I have tested this on a slurm-gcp and on my local machine. I have updated the example commands accordingly in the README.

Please feel free to comment on the PR and raise any concerns you have! I am more than willing to change something if you think things are not developer friendly!

syed-ahmed avatar Apr 13 '21 23:04 syed-ahmed

cc: @mithro

syed-ahmed avatar Apr 13 '21 23:04 syed-ahmed

@mithro Also here's a little more context to the changes:

The majority of the changes are related to changing json to yaml (hydra supports either yaml or python dataclasses (structured config) for composition). The board.json, project jsons, and toolchain dict are separated into their individual yaml files so that Hydra can compose an yaml from the individual yaml files. This is Hydra's model and based on this separation it can launch the processes in parallel. So for instance, when you specifiy board=arty,nexys, project=blinky,oneblink, it will launch four process for instance where each process is a going to get a pair of the configs. In the current code base, I can see how you can achieve the same functionality in exhaust.py but I thought the fixed cost of the refactor due to hydra is low enough for the built in features you are getting in return.

Also here's a small tutorial on Hydra if you want to evaluate it: https://hydra.cc/docs/next/tutorials/basic/your_first_app/simple_cli

syed-ahmed avatar Apr 13 '21 23:04 syed-ahmed

I /think/ the downsides to hydra are potentially deal breakers here. Ideally the default user should not have to understand or know that we are using something like hydra....

mithro avatar Apr 14 '21 03:04 mithro

@syed-ahmed - I would be happy to have a quick video chat to see about the best way to move this forward. I am excited to see progress here!

mithro avatar Apr 14 '21 17:04 mithro

@syed-ahmed - I would be happy to have a quick video chat to see about the best way to move this forward. I am excited to see progress here!

@mithro Sounds good! Sent you an email.

syed-ahmed avatar Apr 14 '21 19:04 syed-ahmed

Closing this for now and will open an issue (may be in edalize) for further discussion on distributed runs.

syed-ahmed avatar Apr 15 '21 21:04 syed-ahmed

Thanks so much @omry for the comments! Let me re-open this PR and will use this as a prototyping/discussion space. We are currently evaluating a broader use case of a tool like hydra in the EDA community, where https://github.com/olofk/edalize would generate build/task workflows/possibly a DAG of tasks (that's why asked the question in the hydra channel) and a task runner (which could be hydra + ray, or CWL+airflow or gusty+hydra+airflow) would execute the workflow.

We had a really good experience using Hydra in our research group (ic.ese.upenn.edu) and are happy with everything it's providing for free. I would love to see this being used in the FOSS EDA community in some form!

GitHub
An abstraction library for interfacing EDA tools. Contribute to olofk/edalize development by creating an account on GitHub.

syed-ahmed avatar Apr 17 '21 19:04 syed-ahmed

@omry and also thanks for the review! I'll soon follow up on the changes.

syed-ahmed avatar Apr 17 '21 19:04 syed-ahmed

Thanks so much @omry for the comments! Let me re-open this PR and will use this as a prototyping/discussion space. We are currently evaluating a broader use case of a tool like hydra in the EDA community, where https://github.com/olofk/edalize would generate build/task workflows/possibly a DAG of tasks (that's why asked the question in the hydra channel) and a task runner (which could be hydra + ray, or CWL+airflow or gusty+hydra+airflow) would execute the workflow.

So the idea is to use Hydra to compose the config for edalize, which in turn will generate the build steps? This seems like a good match. In particular, I can imagine multirun can be useful to build for multiple targets at the same time etc.

As for the DAG: Hydra launching is simplistic right now and does not support full fledged GANS. you can use Hydra to launch the initial application (or multiple applications in multirun), which in turn can launch the DAG using something like Ray. The configuration can still be composed by Hydra, you can either pass individual config nodes as you are launching the DAG, or compose new configs on demand using the Compose API.

We had a really good experience using Hydra in our research group (ic.ese.upenn.edu) and are happy with everything it's providing for free. I would love to see this being used in the FOSS EDA community in some form!

This is really great to hear. I would definitely love to see it getting adopted by more communities with high level of complexity like the EDA community.

omry avatar Apr 17 '21 20:04 omry

Hi @omry,

I think the biggest issue currently with Hydra is the conflict between the combination of needing to "go all in" and "being opinionated". This makes me super cautious about adopting the project, the benefits need to large because the risks are large. The fact that @syed-ahmed has had success is the only reason it hasn't been an immediate no.

A project which you have to make a "big bet on" (aka "Go all in on") needs to allow multiple approaches and different opinions because otherwise a user is put in an impossible situation when they end up disagreeing with a particular approach or opinion in only one area.

A project which you can adopt incrementally or is modular is much more able to "be opinionated" because if you end up with a disagreement you don't have to adopt the parts you disagree on.

The fact that "Hydra assumes that you will replace all your argparse code with the hydra configuration" is a red flag to me.

Other things I'm looking at is around the number of current users and success stories. I will take a technically inferior project which has a thriving community and user base over a technically better solution that nobody seems willing to use. It is much harder to solve community problems than technical ones! The fact that Hydra might have decent adoption in the machine learning space is a good sign that I plan to investigate more.

Lastly, a lot of the work on SymbiFlow and related projects like VerilogToRouting is being funded by my employer (@Google) and hence we need solutions that work with Google Cloud Platform as it is easy to provide free resources to projects there.

Happy to look into Hydra more,

Tim '@mithro' Ansell

mithro avatar Apr 18 '21 00:04 mithro

Hi Tim,

There are ways to benefit from Hydra incrementally, but your initial goal of integrating with SLURM - if done via Hydra - requires going all in. Let's break things down a bit:

argparse

Hydra provide the ability to compose an hierarchical configuration dynamically and to override everything the resulting config object. Hydra is using argparse for some of it's things to allow it to differentiate between things that app needs and things Hydra needs (as an example --multirun to run with multiple configuration or --cfg to show the resulting config without running the app). Everything else (things not starting with --) belongs to the app.

There are a few ways I can see around it (and I am not sure transparent SLURM launching would work with any of those solutions):

  1. Put argparse on top and composing the config using the Compose API. At this point your application is no longer a Hydra application, but an application using Hydra APIs. you lose the functionality of launching, sweeping, tab completion and other things but you keep argparse and you get to compose your config dynamically (although it's responsibility to interface the command line with the Compose API if you want control over the config from the command line).

  2. Create a second entry Hydra entry point for your app, keeping the existing one as legacy and starting to transition use cases. This is the path fairseq took. I only recommend it if the cost of a breaking change is big, it's a harder path.

  3. Wrap the existing cli with an Hydra and have the Hydra call use execute the current CLI as a subprocess. I would only consider this if any changes to the existing CLI are out of the questions. I think this will cause many difficulties.

Launching

Hydra does not provide an API for launching, it's done via the command line only. However, launcher plugins can be implemented for various backends. We currently have 4 public launcher plugins (Joblib, Submitit (SLURM), Redis Queue, and Ray (AWS and more)). At Facebook we have a second launcher for SLURM using an internal API, and a second launcher for an internal cluster. Google can build the same launchers and host them internally. Here is the example launcher. Implementing those correctly can be a bit tricky but there is a pretty solid test suite than ensures that the launcher plugins are compliant.

Hydra's launching and sweeping is "the cherry on the cake". About 80% of the value is coming from the flexible config composition. An app can use any launching API directly if it wants to. For SLURM, you can even use sbatch directly but that's not fun at all and generates a lot of boilerplate scripts.

Success stories

Hydra is a new project, I open sourced it in October 2019 (a year and a half ago). It got adopted at Facebook as the configuration platform for most future machine learning projects. It also got significant adoption inside FAIR (Facebook AI Research).

As for public adoption: Hydra crossed 4k stars recently and the number of public GitHub repositories with a dependency on Hydra is over 750. Most of the projects are individual research projects, but there are some major frameworks like fairseq and NVidia NeMo that are using it.

It's getting a lot of traction with the deep learning community, which is where it originated.

I totally get reluctance of going all in on a project you have just heard of (Or even a project you heard of a lot but did not try for yourself). One suggestion I have is play with it on a new lower risk project to get comfortable with it before attempting to port a big project.

omry avatar Apr 18 '21 01:04 omry

Finally, another path to migrate to Hydra incrementally is to use OmegaConf as the internal config object without changing anything else. OmegaConf is the lower level config library powering Hydra. You can get a lot by switching to it before switching to Hydra. This is the path mmf took. They will migrate to Hydra next.

omry avatar Apr 18 '21 01:04 omry

argparse

Hydra provide the ability to compose an hierarchical configuration dynamically and to override everything the resulting config object.

This is a pretty typical way of doing things like this. Sounds pretty good.

Hydra is using argparse for some of it's things to allow it to differentiate between things that app needs and things Hydra needs (as an example --multirun to run with multiple configuration or --cfg to show the resulting config without running the app). Everything else (things not starting with --) belongs to the app.

The thing that is confusing me a bit is why hydra needs the command line of the tasks being run to be modified? Ultimately hydra generates a bunch of runs which have a certain set of properties that end up changing the behaviour of each run in some way right? There doesn't seem to be any reason that these properties could not be passed to subtasks via command line arguments, environment variables, files or some combination around all these?

Handling the separate between a runners arguments and a sub-tasks arguments has two pretty "common" solutions.

One typical way is to use a "double dash" (--) which normally ends the current tools argument passing, leaving the remaining command line as a string which can be given to the internal tasks, I could imagine something like;

hydra_top.py --runner-arg1 --runner-arg2 --cmd-to-run=mytool -- --tool-extra-argument1 --tool-extra-argument2=$HYDRA_VALUE_B

mytool could still be run with mytool --tool-argument1 --tool-argument2="Hello everyone"?

Another option that is even built into argparse is the idea of subcommands which work like hydra_top.py -runner-arg1 --runner-arg2 --cmd-to-run=mytool run-experiment --tool-extra-argument1 --tool-extra-argument2=$HYDRA_VALUE_B. This is pretty commonly found in build systems too.

Is there some interaction between hydra and the tasks being run that I don't understand?

mithro avatar Apr 18 '21 01:04 mithro

Lastly, a lot of the work on SymbiFlow and related projects like VerilogToRouting is being funded by my employer (@google) and hence we need solutions that work with Google Cloud Platform as it is easy to provide free resources to projects there.

With regards to GCP usage, I have been using the hydra slurm launcher on Google Cloud. Setting up a Slurm cluster using terraform (https://github.com/SchedMD/slurm-gcp) on GCP is pretty straightforward and by using slurm launcher in hydra, I was able to get rid of all the jinja generated sbatch scripts I was using! I experimented with the ray launcher on GCP as well and am willing to add a ray_gcp launcher to hydra (it's pretty much the same code as the ray_aws launcher). I believe there is also a "help wanted" for airflow launcher integration (https://github.com/facebookresearch/hydra/issues/221). I have briefly looked at airflow launcher support but it seems like it would be more work and am currently evaluating if airflow gives something that ray doesn't. Of course if a google cloud composer dev can add an cloud composer airflow launcher, that would be very beneficial for others :)

GitHub
Slurm on Google Cloud Platform. Contribute to SchedMD/slurm-gcp development by creating an account on GitHub.

syed-ahmed avatar Apr 18 '21 01:04 syed-ahmed

I haven't read it yet, but there seems to be some interesting analysis at https://dmtn-025.lsst.io/

mithro avatar Apr 18 '21 02:04 mithro

Hydra provide the ability to compose an hierarchical configuration dynamically and to override everything the resulting config object.

This is a pretty typical way of doing things like this. Sounds pretty good.

Can you point to another system that provides similar capabilities? I am not aware of any.

The thing that is confusing me a bit is why hydra needs the command line of the tasks being run to be modified?

A Hydra application is launching itself from the same command line used for normal (local) execution. Take a look at the SLURM example here, including the example application (Note that the Submitit related code in the example application is not required to launch to SLURM).

Taking a step back: Breaking away from the getopt standard was an early decision. I don't think using -- is providing any value beyond "this is how we have always done things". Not supporting it sends a clear signal that Hydra is not compatible with getopt style argument parsers. Hydra is much more powerful and trying to provide such compatibility would only limit the design space of what can be done in the command line for no good reason. An app using Hydra to the fullest have to let give up managing the command line in exchange for the flexibility Hydra is offering.

Ultimately hydra generates a bunch of runs which have a certain set of properties that end up changing the behaviour of each run in some way right?

@hydra.main is taking over many aspects of the application beyond the command line for it to be able to execute an application from on a remote cluster from it's own command line. This includes working directory management and Python logging configuration.

Handling the separate between a runners arguments and a sub-tasks arguments has two pretty "common" solutions.

This is not about runners and subtask. Hydra can configure multiple independent subsystems at the same time (e.g, Python logging, The Hydra Launcher, the Hydra Sweeper etc and the app itself, including any subsystems it may have). Those are all handled in the same way. All of them can be composed, any anything can be overridden from the command line.

This is about a very limited set of low level Hydra specific flags that are orthogonal to configuring the app and any subsystem.

Generally speaking, If Hydra is not a good match you should not use it. I am not open to making changes to Hydra for the sake of any individual project - I have to consider the entire eco-system.

I haven't read it yet, but there seems to be some interesting analysis at https://dmtn-025.lsst.io/

Thanks. Out of the three systems there, I am only aware of airflow. Hydra is not an alternative to it (It could interface with it via a Launcher plugin if anyone makes that plugin).

omry avatar Apr 18 '21 02:04 omry

Hydra provide the ability to compose an hierarchical configuration dynamically and to override everything the resulting config object.

This is a pretty typical way of doing things like this. Sounds pretty good.

Can you point to another system that provides similar capabilities? I am not aware of any.

There is some discussion around the whole topic of configuration languages at https://sre.google/workbook/configuration-design/ and https://sre.google/workbook/configuration-specifics/ -- The knowledge in the sre.google book mostly comes from dealing with an internal Google language which has a lot of the properties you are describing and is both loved and hated at the same time.

Google has a bit of "not invented here" syndrome so never really looked at any of the XML based solutions which have many similar properties and where heavily used back in the late 1990s / early 2000s.

Other tools which have large configuration systems like build tools like Bazel or system configuration tools like Chef / Puppet frequently end up with systems like this too.

I believe whoever designed OmegaConf at least has had some experience with this topic as it seems to have a decent of properties.

This is not about runners and subtask. Hydra can configure multiple independent subsystems at the same time (e.g, Python logging, The Hydra Launcher, the Hydra Sweeper etc and the app itself, including any subsystems it may have). Those are all handled in the same way. All of them can be composed, any anything can be overridden from the command line.

This is about a very limited set of low level Hydra specific flags that are orthogonal to configuring the app and any subsystem.

From what I can see, the Hydra approach is to have tight integration between the job launcher / coordinator / control and the jobs actually being run which requires "taking over" the job. This is a valid design decision and it has both a large number of advantages and a large number of disadvantages.

Generally speaking, If Hydra is not a good match you should not use it.

That is very reasonable and the approach we will be taking and what this pull request is exploring. Start with understanding if Hydra is a good fit and thus if we should consider adopting it.

However, the fact that both a user has to go "all in" on Hydra and Hydra is strongly opinionated does raise the bar for adoption much higher. These two properties mean that even when Hydra is an excellent match if there is even just one area of disagreement it can be impossible to adopt the solution.

We will continue to explore what the options are here and if Hydra's benefits are worth the costs.

All the best with your project!

mithro avatar Apr 18 '21 04:04 mithro

There is some discussion around the whole topic of configuration languages at https://sre.google/workbook/configuration-design/ and https://sre.google/workbook/configuration-specifics/ -- The knowledge in the sre.google book mostly comes from dealing with an internal Google language which has a lot of the properties you are describing and is both loved and hated at the same time.

Thanks, will go over those.

Google has a bit of "not invented here" syndrome so never really looked at any of the XML based solutions which have many similar properties and where heavily used back in the late 1990s / early 2000s.

Oh, XML is a disaster, let's not go there :). I actually created a cute alternative to XML as a config language called Swush back at 2009. It never got any traction.

Other tools which have large configuration systems like build tools like Bazel or system configuration tools like Chef / Puppet frequently end up with systems like this too. I believe whoever designed OmegaConf at least has had some experience with this topic as it seems to have a decent of properties.

My experience with chef/puppet and build systems is that their "config language" gravitate toward becoming a full fledged programming language, which can make things difficult on many dimensions.

OmegaConf is also my project (https://github.com/omry/omegaconf). The story is that I needed a flexible configuration system for a new research project and could not find anything that fit the bill. I created OmegaConf, and it evolved side by side with the research project. Hydra is a spinoff from that research project, that took the core concepts from it and made it a generic solution with all of the same properties (and more), but in a way that is easy to use for other projects.

This is not about runners and subtask. Hydra can configure multiple independent subsystems at the same time (e.g, Python logging, The Hydra Launcher, the Hydra Sweeper etc and the app itself, including any subsystems it may have). Those are all handled in the same way. All of them can be composed, any anything can be overridden from the command line. This is about a very limited set of low level Hydra specific flags that are orthogonal to configuring the app and any subsystem.

From what I can see, the Hydra approach is to have tight integration between the job launcher / coordinator / control and the jobs actually being run which requires "taking over" the job. This is a valid design decision and it has both a large number of advantages and a large number of disadvantages.

That's a pretty good characterization. Since the config composition can be used in isolation from @hydra.main, one can integrate it with alternative designs. (Building something like this is a significant investment though).

Generally speaking, If Hydra is not a good match you should not use it.

That is very reasonable and the approach we will be taking and what this pull request is exploring. Start with understanding if Hydra is a good fit and thus if we should consider adopting it.

However, the fact that both a user has to go "all in" on Hydra and Hydra is strongly opinionated does raise the bar for adoption much higher. These two properties mean that even when Hydra is an excellent match if there is even just one area of disagreement it can be impossible to adopt the solution.

Hydra is extremely flexible in most areas. Its using the same config composition principles it's preaching, making it easy to change its behavior to fit many scenarios. There are of course some areas where it's not flexible at all (on principle or just because something is not supported yet). In practice most people that try it out stick with it even if it's not perfect for them because the value tradeoff is worth it. Some people end up using straight OmegaConf as it gives them more control (at the loss of some higher level features Hydra is providing).

We will continue to explore what the options are here and if Hydra's benefits are worth the costs.

Awesome. Happy to answer questions here or on the Hydra chat.

omry avatar Apr 18 '21 04:04 omry

Thanks @omry and @mithro for taking the time for such detailed replies!

syed-ahmed avatar Apr 18 '21 05:04 syed-ahmed