AMPL integrative test test_filesystem_perf_results.py::test_AttentiveFP_results fails with GPU memory error on AzureML NC6sv2 compute SKU

integrative test test_filesystem_perf_results.py::test_AttentiveFP_results fails with GPU memory error on AzureML NC6sv2 compute SKU

Open bmoxon opened this issue 1 year ago • 1 comments

version 1.5.1 python 3.8.10 conda environment with python venv installation as per AMPL README.md installation instructions

test_filesystem_perf_results.py fails on Azure NC6sv2 compute instance with CUDA out-of-memory error NC6s_v2 is a 1-GPU, 6-core SKU with 16GB GPU memory.

Appears to be an issue with GC and/or failure to release CUDA cache in one or more of the tests and/or model_wrapper when run within a single pytest process. Running the test_AttentitiveFP_results test standalone, or with a --fork flag (using the pytest-forked plugin) succeeds.

These 3 cases below ...

(1) When run with a single process for the 8 tests in test_filesystem_perf_results.py, the test "hangs" using default pytest, i.e. within the compare_models/ directory ...

(2) When pytest is run with the -x flag, the test fails with a CUDA error / out of memory error as below

(3) When pytest is run with --forked (using the pytest-fork plugin), the test passes, as below.

# standard pytest invocation
(atomsci) (azureml_py38_PT_TF) azureuser@nc6sv2-dev2:~/localfiles/AMPL/atomsci/ddm/test/integrative/compare_models$ pytest
======================================== test session starts =========================================
platform linux -- Python 3.8.10, pytest-6.2.1, py-1.11.0, pluggy-0.13.1
rootdir: /home/azureuser/localfiles/AMPL/atomsci/ddm/test, configfile: pytest.ini
plugins: forked-1.6.0
collected 8 items                                                                                    

test_filesystem_perf_results.py ...F<hangs here>

# pytest -x invocation
(atomsci) (azureml_py38_PT_TF) azureuser@nc6sv2-dev2:~/localfiles/AMPL/atomsci/ddm/test/integrative/compare_models$ pytest -x
======================================== test session starts =========================================
platform linux -- Python 3.8.10, pytest-6.2.1, py-1.11.0, pluggy-0.13.1
rootdir: /home/azureuser/localfiles/AMPL/atomsci/ddm/test, configfile: pytest.ini
plugins: forked-1.6.0
collected 8 items                                                                                    

test_filesystem_perf_results.py ...F

============================================== FAILURES ==============================================
______________________________________ test_AttentiveFP_results ______________________________________

    def test_AttentiveFP_results():
        clean()
        H1_curate()
        json_f = 'jsons/reg_config_H1_fit_AttentiveFPModel.json'
    
>       df1, df2, model_info = all_similar_tests(json_f, 'H1')

test_filesystem_perf_results.py:188: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test_filesystem_perf_results.py:107: in all_similar_tests
    train_and_predict(json_f, prefix=prefix)
../delaney_Panel/test_delaney_panel.py:192: in train_and_predict
    model.train_model()
../../../pipeline/model_pipeline.py:573: in train_model
    self.model_wrapper = model_wrapper.create_model_wrapper(self.params, self.featurization, self.ds_client)
../../../pipeline/model_wrapper.py:221: in create_model_wrapper
    return PytorchDeepChemModelWrapper(params, featurizer, ds_client)
../../../pipeline/model_wrapper.py:2330: in __init__
    self.model = self.recreate_model()
../../../pipeline/model_wrapper.py:2355: in recreate_model
    model = chosen_model(
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/deepchem/models/torch_models/attentivefp.py:276: in __init__
    super(AttentiveFPModel, self).__init__(
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/deepchem/models/torch_models/torch_model.py:198: in __init__
    self.model = model.to(device)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:907: in to
    return self._apply(convert)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:578: in _apply
    module._apply(fn)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:578: in _apply
    module._apply(fn)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:578: in _apply
    module._apply(fn)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:578: in _apply
    module._apply(fn)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:578: in _apply
    module._apply(fn)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:601: in _apply
    param_applied = fn(param)
../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/nn/modules/module.py:905: in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def _lazy_init():
        global _initialized, _queued_calls
        if is_initialized() or hasattr(_tls, 'is_initializing'):
            return
        with _initialization_lock:
            # We be double-checked locking, boys!  This is OK because
            # the above test was GIL protected anyway.  The inner test
            # is for when a thread blocked on some other thread which was
            # doing the initialization; when they get the lock, they will
            # find there is nothing left to do.
            if is_initialized():
                return
            # It is important to prevent other threads from entering _lazy_init
            # immediately, while we are still guaranteed to have the GIL, because some
            # of the C calls we make below will release the GIL
            if _is_in_bad_fork():
                raise RuntimeError(
                    "Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
                    "multiprocessing, you must use the 'spawn' start method")
            if not hasattr(torch._C, '_cuda_getDeviceCount'):
                raise AssertionError("Torch not compiled with CUDA enabled")
            if _cudart is None:
                raise AssertionError(
                    "libcudart functions unavailable. It looks like you have a broken build?")
            # This function throws if there's a driver initialization error, no GPUs
            # are found or any other error occurs
>           torch._C._cuda_init()
E           RuntimeError: CUDA error: out of memory
E           CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
E           For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

../../../../../../.pythonenvs/atomsci/lib/python3.8/site-packages/torch/cuda/__init__.py:216: RuntimeError
---------------------------------------- Captured stdout call ----------------------------------------
num_model_tasks is deprecated and its value is ignored.
========================================== warnings summary ==========================================
integrative/compare_models/test_filesystem_perf_results.py::test_RF_results
integrative/compare_models/test_filesystem_perf_results.py::test_RF_results
integrative/compare_models/test_filesystem_perf_results.py::test_XGB_results
integrative/compare_models/test_filesystem_perf_results.py::test_XGB_results
  /home/azureuser/localfiles/AMPL/atomsci/ddm/pipeline/featurization.py:1730: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
    calc_smiles_feat_df[col] = calc_desc_df[col]

integrative/compare_models/test_filesystem_perf_results.py::test_RF_results
integrative/compare_models/test_filesystem_perf_results.py::test_RF_results
integrative/compare_models/test_filesystem_perf_results.py::test_XGB_results
integrative/compare_models/test_filesystem_perf_results.py::test_XGB_results
  /home/azureuser/localfiles/AMPL/atomsci/ddm/pipeline/transformations.py:255: RuntimeWarning: invalid value encountered in true_divide
    X = np.nan_to_num((X - self.X_means) * X_weight / self.X_stds)

integrative/compare_models/test_filesystem_perf_results.py::test_NN_results
  /home/azureuser/localfiles/AMPL/atomsci/ddm/pipeline/model_wrapper.py:2592: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
    chkpt_dict = yaml.load(chkpt_in.read())

-- Docs: https://docs.pytest.org/en/stable/warnings.html
====================================== short test summary info =======================================
FAILED test_filesystem_perf_results.py::test_AttentiveFP_results - RuntimeError: CUDA error: out of...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
======================== 1 failed, 3 passed, 9 warnings in 319.56s (0:05:19) =========================

# pytest --fork invocation

(atomsci) (azureml_py38_PT_TF) azureuser@nc6sv2-dev2:~/localfiles/AMPL/atomsci/ddm/test/integrative/compare_models$ pytest --forked
======================================== test session starts =========================================
platform linux -- Python 3.8.10, pytest-6.2.1, py-1.11.0, pluggy-0.13.1
rootdir: /home/azureuser/localfiles/AMPL/atomsci/ddm/test, configfile: pytest.ini
plugins: forked-1.6.0
collected 8 items                                                                                    

test_filesystem_perf_results.py ........                                                       [100%]

=================================== 8 passed in 443.37s (0:07:23) ====================================

Apr 05 '23 23:04 bmoxon

AMPL AMPL copied to clipboard

integrative test test_filesystem_perf_results.py::test_AttentiveFP_results fails with GPU memory error on AzureML NC6sv2 compute SKU

AMPL
AMPL copied to clipboard