physicsnemo icon indicating copy to clipboard operation
physicsnemo copied to clipboard

🐛[BUG]: DistributedManager gets silently initialized as a single process job if instantiated before initializing

Open akshaysubr opened this issue 1 year ago • 2 comments

Version

main

On which installation method(s) does this occur?

Source

Describe the issue

This works as expected:

In [1]: from modulus.distributed import DistributedManager

In [2]: DistributedManager.is_initialized()
Out[2]: False

In [3]: DistributedManager.initialize()

In [4]: DistributedManager.is_initialized()
Out[4]: True

In [5]: manager = DistributedManager()

In [6]: manager._initialization_method
Out[8]: 'None'

but this does not:

  In [1]: from modulus.distributed import DistributedManager                                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                                                              
  In [2]: manager = DistributedManager()                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                
  In [3]: manager._initialization_method                                                                                                                                                                                                                                                                                        
  Out[3]: 'None'                                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                                
  In [4]: manager.is_initialized()                                                                                                                                                                                                                                                                                              
  Out[4]: True     

Minimum reproducible example

In [1]: from modulus.distributed import DistributedManager                                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                                                               
  In [2]: manager = DistributedManager()                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                
  In [3]: manager._initialization_method                                                                                                                                                                                                                                                                                        
  Out[3]: 'None'                                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                                
  In [4]: manager.is_initialized()                                                                                                                                                                                                                                                                                              
  Out[4]: True                                                                                                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                                                                
  In [5]: manager.initialize()                                                                                                                                                                                                                                                                                                  
  /code/modulus-core/modulus/distributed/manager.py:302: UserWarning: Distributed manager is already intialized                                                                                                                                                                                                                 
    warn("Distributed manager is already intialized")

Relevant log output

No response

Environment details

No response

akshaysubr avatar Apr 25 '24 05:04 akshaysubr

One of the reasons this is happening is because the initialization check in the DistributedManager is based on checking the size of DistributedManager._shared_state: https://github.com/NVIDIA/modulus/blob/main/modulus/distributed/manager.py#L194-L197

This silent initialization can be caught by having an explicit _is_initialized member in the Borg class and only setting that to True in the initialize method.

akshaysubr avatar Apr 25 '24 05:04 akshaysubr

@tge25 @dallasfoster Would this be a better way to prevent accidental usage of the DistributedManager before it is initialized?

In [1]: from modulus.distributed import DistributedManager

In [2]: DistributedManager.is_initialized()
Out[2]: False

In [3]: manager = DistributedManager()
---------------------------------------------------------------------------
ModulusUninitializedDistributedManagerWarningTraceback (most recent call last)
Cell In[3], line 1
----> 1 manager = DistributedManager()

File /code/modulus-core/modulus/distributed/manager.py:115, in DistributedManager.__init__(self)
    113 def __init__(self):
    114     if not self._is_initialized:
--> 115         raise ModulusUninitializedDistributedManagerWarning()
    116     super().__init__()

ModulusUninitializedDistributedManagerWarning: Instantiating DistributedManager before calling DistributedManager.initialize is not recommended

akshaysubr avatar Apr 25 '24 06:04 akshaysubr