doit icon indicating copy to clipboard operation
doit copied to clipboard

Use multiprocess if available

Open tbekolay opened this issue 5 years ago • 6 comments

Windows doesn't have a true fork implementation, so multiprocessing and pickling are limited compared to other platforms. There are also some situations in Linux / Mac OS X where we would like to pickle something, but it does not work without some code changes. This PR makes pickling for mulitprocessing work in more situations by using the multiprocess library, if it's available.multiprocess is a fork of multiprocessing that uses dill instead of pickle.

This PR is set up such that we do not require the user to have multiprocess installed, but if it is installed we'll use it -- except in Windows, where pickling is so limited that it will make things easier to require it. There are a few other routes we could go:

  1. Don't require it in Windows: This makes things simpler for Windows users who don't want to use multiprocessing. Installing the multiprocess package should work on all platforms, but if there's a platform where there are likely to be issues, it's Windows.

  2. Require it for all platforms: This will likely be a net benefit to everyone, but it introduces a new dependency. Also, dill will always be a little bit slower than pickle since it's able to pickle more things. However, if we're not pickling anything too large anyway, performance shouldn't be an issue, and it would be possible to simplify some parts of the codebase by using dill instead of cloudpickle and removing some of the hacks that we currently use to get Windows to pickle things that it normally can't.

I'm happy to modify this PR to implement either option if you want! I've been using doit for a relatively large data science project and it's been a huge help.

tbekolay avatar Aug 04 '20 17:08 tbekolay

Thanks.

  • On runner.py MRunner there is a check if multiprocessing is available. That should be taken into account...

  • In that spirit I guess it would be better to NOT make multiprocess required on Windows.

  • Also it would be nice to consolidate try/catch of the imports. Maybe on "compat.py" file.

  • Thanks for taking care of CHANGES file. Please also mention this in the "install.rst" file.

  • Finally, need to think how to deal this on CI.

schettino72 avatar Aug 05 '20 05:08 schettino72

Require it for all platforms: This will likely be a net benefit to everyone

could you please elaborate

schettino72 avatar Aug 05 '20 05:08 schettino72

It will also be useful on macOS, where spawn became the default in 3.8, and fork doesn’t always work properly.

Kwpolska avatar Aug 05 '20 09:08 Kwpolska

https://github.com/uqfoundation/multiprocess/issues/65

It seems multiprocess package forced a "regression" because one of the features added (Pool) does not work well with "spawn" method. This leads me to think that multiprocess could be accepted as a multiprocessing replacement but should not be installed by default even for Windows/MAC.

schettino72 avatar Sep 05 '20 03:09 schettino72

Hi @schettino72, since I was touching the doit code for #377 I updated this one also to take your feedback into account.

On runner.py MRunner there is a check if multiprocessing is available. That should be taken into account...

Done, I implemented the check in compat.py and call it from MRunner now.

In that spirit I guess it would be better to NOT make multiprocess required on Windows.

Yes, that makes sense. I instead made it an optional dependency that you can install with pip install doit[multiprocess].

Also it would be nice to consolidate try/catch of the imports. Maybe on "compat.py" file.

Done, the imports are done in compat.py now.

Thanks for taking care of CHANGES file. Please also mention this in the "install.rst" file.

Done!

Finally, need to think how to deal this on CI.

I didn't add anything to CI, but you could either modify one of the existing builds in your build matrix to install with multiprocess, by changing

      - run: pip install . -r dev_requirements.txt

to

      - run: pip install .[multiprocess] -r dev_requirements.txt

for that build (might need quotes, like pip install '.[multiprocess]', not sure what kind of shell Github actions uses). Or you could add a new build to the matrix that does the above with the latest version of Python. Those are my ideas anyhow; running the tests locally with multiprocess works correctly.

If you like, I can try to add a test that would fail with multiprocessing but succeed with multiprocess.

tbekolay avatar Nov 16 '21 20:11 tbekolay

If you like, I can try to add a test that would fail with multiprocessing but succeed with multiprocess.

That would be great.

schettino72 avatar Nov 18 '21 08:11 schettino72