recommenders icon indicating copy to clipboard operation
recommenders copied to clipboard

Resolve issue #2018

Open SimonYansenZhao opened this issue 1 year ago • 10 comments

Description

This PR resolves the issue #2018 by commenting out the tests in tests/ci/azureml_tests/test_groups.py to disable testing for deeprec.

Related Issues

  • #2018

References

Checklist:

  • [X] I have followed the contribution guidelines and code style for this project.
  • [ ] I have added tests covering my contributions.
  • [ ] I have updated the documentation accordingly.
  • [X] This PR is being made to staging branch and not to main branch.

SimonYansenZhao avatar Oct 17 '23 03:10 SimonYansenZhao

First, I got NameError: name 'XDeepFMModel' is not defined (See https://github.com/recommenders-team/recommenders/actions/runs/6544943740/job/17772448740#step:3:2992).

I think something might be wrong when importing XDeepFMModel, then after removing the try-except in the commit (https://github.com/recommenders-team/recommenders/pull/2022/commits/88e43a5f112c411cf6801b356ee690f2a0ab5cb4#) (See https://github.com/recommenders-team/recommenders/actions/runs/6545498077/job/17774104408?pr=2022#step:3:3012) I got ModuleNotFoundError: No module named 'keras.layers.legacy_rnn' .

It means Keras.layers.legacy_rnn imported in https://github.com/recommenders-team/recommenders/blob/b000b78ceb3cbe52a0200922f2b2412d830274af/recommenders/models/deeprec/models/sequential/sum_cells.py#L6 is removed from Keras. Need to find a way to replace it.

FYI @miguelgfierro @anargyri

SimonYansenZhao avatar Oct 17 '23 10:10 SimonYansenZhao

First, I got NameError: name 'XDeepFMModel' is not defined (See https://github.com/recommenders-team/recommenders/actions/runs/6544943740/job/17772448740#step:3:2992).

I think something might be wrong when importing XDeepFMModel, then after removing the try-except in the commit (88e43a5) (See https://github.com/recommenders-team/recommenders/actions/runs/6545498077/job/17774104408?pr=2022#step:3:3012) I got ModuleNotFoundError: No module named 'keras.layers.legacy_rnn' .

It means Keras.layers.legacy_rnn imported in

https://github.com/recommenders-team/recommenders/blob/b000b78ceb3cbe52a0200922f2b2412d830274af/recommenders/models/deeprec/models/sequential/sum_cells.py#L6

is removed from Keras. Need to find a way to replace it. FYI @miguelgfierro @anargyri

It looks like this class is deprecated in the new versions; if I recall, it was one of the temporary fixes when we upgraded from v1 to v2. Now you would have to rewrite the sum cell in the new syntax using this, which may not be straightforward. Maybe we should migrate the entire notebook to PyTorch. @Leavingseason what do you think?

That said, I see the code is still in TF master https://github.com/tensorflow/tensorflow/tree/r2.14/tensorflow/python/keras/layers/legacy_rnn They stopped exposing the module but it is still used internally e.g. here I wonder if copying and pasting the legacy_rnn code in our repo would work (provided that the TF license allows for this).

It looks like the conda env is installing 2.13. How about restricting to < 2.13 or < 2.12, do you still get the error? If not, we could restrict the TF version from above.

anargyri avatar Oct 17 '23 12:10 anargyri

@SimonYansenZhao I checked locally with different versions of TF and this import from keras.layers.legacy_rnn.rnn_cell_impl import LayerRNNCell only works up to (and including) version 2.8.

anargyri avatar Oct 17 '23 16:10 anargyri

We need to discuss a bit more how to proceed, it is not clear to me what is the best way forward. For now, let's bound TF with < 2.9.

anargyri avatar Oct 17 '23 16:10 anargyri

pinged Jianxun on teams

miguelgfierro avatar Oct 18 '23 10:10 miguelgfierro

Hi, seen from SimonYansenZhao's discovery, when the TF>=2.9, the reference in the file recommenders/models/deeprec/models/sequential/sum_cells.py, from keras.layers.legacy_rnn.rnn_cell_impl import LayerRNNCell becomes invalid, which causes the XDeepFMModel to fail to import and run.

  • Dive into LayerRNNCell

Upon examining the LayerRNNCell and RNNCell classes in the tf.keras.layers.legacy_rnn.rnn_cell_impl file, I found that LayerRNNCell simply inherits from the RNNCell class. When calling, it reduces one step of building scope compared to RNNCell, so we can directly use "from keras.layers.legacy_rnn.rnn_cell_impl import RNNCell" as the parent class of SUM_Cell.

  • Alternative for LayerRNNCell

In TF>=2.9, the RNNCell class has been deprecated, but it has been rewritten as AbstractRNNCell in tf.keras.layers.recurrent with similar implementation and comments.

  • Suggested solution

Try replace "from keras.layers.legacy_rnn.rnn_cell_impl import LayerRNNCell" with: try: from keras.layers.legacy_rnn.rnn_cell_impl import LayerRNNCell except: from keras.layers.recurrent import AbstractRNNCell as LayerRNNCell to see whether it works.

Master-PLC avatar Oct 20 '23 03:10 Master-PLC

I got the following error when adding from keras.layers.rnn import AbstractRNNCell as LayerRNNCell. It looks like that AbstractRNNCell does not have _reuse. @Master-PLC

Screenshot 2024-01-22 at 22 40 36

SimonYansenZhao avatar Jan 22 '24 15:01 SimonYansenZhao

When using the latest TensorFlow (2.15), I got this error:

E               713 try:
E               714   # pylint: disable=protected-access
E               715   with self._graph._c_graph.get() as c_graph:
E           --> 716     self._session = tf_session.TF_NewSessionRef(c_graph,opts)
E               717   # pylint: enable=protected-access
E               718 finally:
E               719   tf_session.TF_DeleteSessionOptions(opts)
E           
E           InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

It means CUDA driver version used in the VM is lower than the requirement of TensorFlow. I am trying to figure out how to do that. @miguelgfierro @anargyri

SimonYansenZhao avatar Jan 31 '24 12:01 SimonYansenZhao

When using the latest TensorFlow (2.15), I got this error:

E               713 try:
E               714   # pylint: disable=protected-access
E               715   with self._graph._c_graph.get() as c_graph:
E           --> 716     self._session = tf_session.TF_NewSessionRef(c_graph,opts)
E               717   # pylint: enable=protected-access
E               718 finally:
E               719   tf_session.TF_DeleteSessionOptions(opts)
E           
E           InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

It means CUDA driver version used in the VM is lower than the requirement of TensorFlow. I am trying to figure out how to do that. @miguelgfierro @anargyri

Yes, because the latest TF requires cuda 12, previous ones required cuda 11 or less. See https://www.tensorflow.org/install/source#gpu

anargyri avatar Jan 31 '24 17:01 anargyri

@SimonYansenZhao if we upgrade the docker image in the test to cuda 12, will the code with previous versions of cuda 11 work?

miguelgfierro avatar Jan 31 '24 21:01 miguelgfierro

@SimonYansenZhao if we upgrade the docker image in the test to cuda 12, will the code with previous versions of cuda 11 work?

I think it will work. Only the NVIDIA GPU Driver in the CUDA installed in the Docker image is used when testing, because a new Conda env will be created with a CUDA required by Recommenders, the CUDA in this Conda env will not conflict with the CUDA in the Docker image. The CUDA installed in the Conda env will interact with the driver which is back compatible.

SimonYansenZhao avatar Feb 19 '24 15:02 SimonYansenZhao