groundmotion-processing icon indicating copy to clipboard operation
groundmotion-processing copied to clipboard

Simplify installation through optional dependencies

Open arkottke opened this issue 3 years ago • 7 comments

Is your feature request related to a problem? Please describe. The dependency tree is vast. Just to read the h5 files that are produced requires a 2 GB conda environment.

Describe the solution you'd like Add optional dependencies that are only needed for the actual processing.

arkottke avatar Jul 19 '21 23:07 arkottke

I like this idea, and I didn't even know that this was something that was possible. The challenge is that the internal modules are very interconnected, and it will require a heroic refactor to create dependency islands. This is probably an indication that we could have planned out the package structure better.

Another potential obstacle for this is how we use argparse for dynamically generating help from all of the subcommands. Simply running gmrecords -h gives documentation that is generated from each of the subcommand modules, and if each of those is loaded, then I believe that all of their imports are also loaded.

To use your example of the loading HDF file, clearly the fully repository is overkill in many cases. One could just install pyasdf and do a lot. But we have a lot of helper functions, especially to parse metrics and metadata that are not part of the ASDF definition. We could try to find the minimal dependencies to load the gmprocess.io.asdf.stream_workspace but looking at it now, it includes external dependencies like numpy, pyasdf, pandas, impactutils, and mapio. That's a 1 GB conda environment right there.

So I'd like to keep this issue open as aspirational, but I'm not expecting that we'll be able to make progress on it in the short term.

emthompson-usgs avatar Nov 06 '21 05:11 emthompson-usgs

Scratch that. As a trial I re-organized the internal dependencies of download and assemble. It wasn't nearly as hard as I expected, and it makes for much better organized code because these subcommands were unnecessarily intermingled. I'll have to test if the subcommands can run with a limited dependency environment (the argparse issue might put a wrench in that still).

emthompson-usgs avatar Nov 07 '21 00:11 emthompson-usgs

The smaller list of dependencies that should be required to run download is:

python
configobj
ruamel.yaml>=0.17.16
pandas>=1.0
pytz
libcomcat>=2.0.13
obspy>=1.2.1
matplotlib>=3.1.0
numpy>=1.21
scipy>=1.7
setuptools-scm>=6.3.2

This results in a 962MB conda environment. When I run gmrecords download I get an error that an import in the assemble module cannot be found. The error is

$ gmrecords download -e se609212
Traceback (most recent call last):
  File "/miniconda/envs/test/bin/gmrecords", line 33, in <module>
    sys.exit(load_entry_point('gmprocess', 'console_scripts', 'gmrecords')())
  File "/src/python/groundmotion-processing/gmprocess/bin/gmrecords.py", line 6, in main
    GMrecordsApp().main()
  File "/src/python/groundmotion-processing/gmprocess/apps/gmrecords.py", line 65, in __init__
    self._parse_command_line()
  File "/src/python/groundmotion-processing/gmprocess/apps/gmrecords.py", line 191, in _parse_command_line
    subcommands = {
  File "/src/python/groundmotion-processing/gmprocess/apps/gmrecords.py", line 192, in <dictcomp>
    name: importlib.import_module(name)
  File "/miniconda/envs/test/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/src/python/groundmotion-processing/gmprocess/subcommands/assemble.py", line 8, in <module>
    from dask.distributed import Client, as_completed
ModuleNotFoundError: No module named 'dask'

I think this could be solved by re-working how we are getting the subcommand info for argparse. Although I kinda like the way it works currently, I'm open to alternatives.

I can think of a few alternative ways to solve the fundamental problem of a bloated conda environment:

  1. Figure out how to remove dependencies. This is the most straight-forward but also potentially the most difficult. I don't think we have any dependencies that are easy to replace.
  2. Break up the package into multiple packages that each would would a smaller list of dependencies. This seems like it would be less convenient to work with though.
  3. Distribute a container with the code already installed so users don't have to deal with the installation.

emthompson-usgs avatar Nov 07 '21 01:11 emthompson-usgs

And if you add dask back into that list, how big is the environment?

arkottke avatar Nov 09 '21 21:11 arkottke

Well, it doesn't really matter because the dask import error is coming from the "assemble" module, which isn't getting called here and shouldn't be imported. So, this is an indication that the modules from all of the subcommands are imported. So we'd be left with the full environment unless we decouple the subcommands with their own command line programs.

emthompson-usgs avatar Nov 22 '21 00:11 emthompson-usgs

I think I finally understand how to do this in an effective way, which is to refactor as a native namespace package. We'll also get a lot of from doing this with some of our repositories that are dependencies of this one so I'm going to do that first.

emthompson-usgs avatar Mar 15 '22 19:03 emthompson-usgs

In my mind, it would make sense to partition the project into two pieces:

  1. Extension of the ASDF format. We could use either the native namespace package approach, or make it a different repository. This would be enough code to provide the interface for the workspace container and be used for people that want to interact with a previously computed workspace in a read-only manner.
  2. Processing of time series and metrics. This could be further subdivided, but I feel like if you are going to be building workspaces then you probably are going to want to perform each step, and need all of the associated tools.

arkottke avatar Mar 16 '22 15:03 arkottke