Megatron-DeepSpeed icon indicating copy to clipboard operation
Megatron-DeepSpeed copied to clipboard

Error in Installation due to circular import

Open lintangsutawika opened this issue 2 years ago • 4 comments

I am trying to run tests on the codebase. I am using a docker image on an AWS p3.2xlarge (Tesla V100)

docker pull nvcr.io/nvidia/pytorch:21.10-py3

running python -m pip install -e . returns the error

  ERROR: Command errored out with exit status 1:
   command: /workspace/Megatron-DeepSpeed/megatron/bin/python /workspace/Megatron-DeepSpeed/megatron/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /tmp/tmpw_grs406
       cwd: /workspace/Megatron-DeepSpeed
  Complete output (20 lines):
  Traceback (most recent call last):
    File "/workspace/Megatron-DeepSpeed/megatron/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 280, in <module>
      main()
    File "/workspace/Megatron-DeepSpeed/megatron/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 263, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/workspace/Megatron-DeepSpeed/megatron/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 114, in get_requires_for_build_wheel
      return hook(config_settings)
    File "/tmp/pip-build-env-ijyc1r97/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 154, in get_requires_for_build_wheel
      return self._get_build_requires(
    File "/tmp/pip-build-env-ijyc1r97/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 135, in _get_build_requires
      self.run_setup()
    File "/tmp/pip-build-env-ijyc1r97/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 258, in run_setup
      super(_BuildMetaLegacyBackend,
    File "/tmp/pip-build-env-ijyc1r97/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 150, in run_setup
      exec(compile(code, __file__, 'exec'), locals())
    File "setup.py", line 25, in <module>
      from megatron.package_info import (
    File "/workspace/Megatron-DeepSpeed/megatron/__init__.py", line 15, in <module>
      import torch
  ModuleNotFoundError: No module named 'torch'
  ----------------------------------------

Upon further investigations, I cd to the megatron directory and tried import torch but got this error

Python 3.8.12 | packaged by conda-forge | (default, Sep 29 2021, 19:52:28) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/workspace/Megatron-DeepSpeed/megatron/lib/python3.8/site-packages/torch/__init__.py", line 643, in <module>
    from .functional import *  # noqa: F403
  File "/workspace/Megatron-DeepSpeed/megatron/lib/python3.8/site-packages/torch/functional.py", line 6, in <module>
    import torch.nn.functional as F
  File "/workspace/Megatron-DeepSpeed/megatron/lib/python3.8/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/workspace/Megatron-DeepSpeed/megatron/lib/python3.8/site-packages/torch/nn/modules/__init__.py", line 2, in <module>
    from .linear import Identity, Linear, Bilinear, LazyLinear
  File "/workspace/Megatron-DeepSpeed/megatron/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 6, in <module>
    from .. import functional as F
  File "/workspace/Megatron-DeepSpeed/megatron/lib/python3.8/site-packages/torch/nn/functional.py", line 11, in <module>
    from .._jit_internal import boolean_dispatch, _overload, BroadcastingList1, BroadcastingList2, BroadcastingList3
  File "/workspace/Megatron-DeepSpeed/megatron/lib/python3.8/site-packages/torch/_jit_internal.py", line 24, in <module>
    import torch.distributed.rpc
  File "/workspace/Megatron-DeepSpeed/megatron/lib/python3.8/site-packages/torch/distributed/__init__.py", line 52, in <module>
    from .distributed_c10d import *  # noqa: F403
  File "/workspace/Megatron-DeepSpeed/megatron/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3, in <module>
    import logging
  File "/workspace/Megatron-DeepSpeed/megatron/logging.py", line 22, in <module>
    from logging import CRITICAL  # NOQA
ImportError: cannot import name 'CRITICAL' from partially initialized module 'logging' (most likely due to a circular import) (/workspace/Megatron-DeepSpeed/megatron/logging.py)

It seems that the logging.py is affecting torch?

lintangsutawika avatar Oct 29 '21 05:10 lintangsutawika

I was able to reproduce on a clean env.

It's weird, I've fixed the second point by renaming the file to logging_utils.py instead. However that didn't fix it. So I think the issues are unrelated.

Running python setup.py install worked ... I'm not sure why one is failing and not the other one ...

thomasw21 avatar Oct 29 '21 14:10 thomasw21

So I tried tackling this again. Turns out pip install -e . --no-use-pep517 works. I'm still unclear why that is.

thomasw21 avatar Jan 05 '22 01:01 thomasw21

Yeah, there are multiple issues with circular imports in megatron-lm. We fixed some but more are popping up.

And I don't think it was designed for being installed. Remember it has to compile cuda kernels which are in the source. Deepspeed has a way to do to that works well, but not Megatron.

stas00 avatar Jan 09 '22 01:01 stas00

Turns out pip install -e . --no-use-pep517 works. I'm still unclear why that is.

can this somehow be enabled automatically in setup.py? That's too hard to remember

stas00 avatar Jan 09 '22 01:01 stas00