trax icon indicating copy to clipboard operation
trax copied to clipboard

[Help/Bug] Issues with rl_trainer

Open davidjbuerger opened this issue 4 years ago • 0 comments

Description

train_rl fails on different issues, depending on tensorflow version. The version of Python (tested 3.7 and 3.8) does not seem to play a role. A Traceback for each version of tensorflow (2.2.0 and 2.3.0) is posted below. Somehow, i could not get the rl_trainer to work, not even like posted in the docs 1. I'd appreciate some help, if it's not a trax related issue. Cheers :)

...

Environment information

OS: Manjaro Linux, Release 20.0.3

$ pip freeze | grep trax
trax==1.3.4

$ pip freeze | grep tensor
mesh-tensorflow==0.1.16
tensor2tensor==1.15.7
tensorboard==2.2.2
tensorboard-plugin-wit==1.7.0
tensorflow==2.2.0
tensorflow-addons==0.10.0
tensorflow-datasets==3.2.1
tensorflow-estimator==2.2.0
tensorflow-gan==2.0.0
tensorflow-hub==0.8.0
tensorflow-metadata==0.22.2
tensorflow-probability==0.7.0
tensorflow-text==2.3.0

And:
mesh-tensorflow==0.1.16
tensor2tensor==1.15.7
tensorboard==2.2.2
tensorboard-plugin-wit==1.7.0
tensorflow==2.3.0rc0
tensorflow-addons==0.10.0
tensorflow-datasets==3.2.1
tensorflow-estimator==2.3.0
tensorflow-gan==2.0.0
tensorflow-hub==0.8.0
tensorflow-metadata==0.22.2
tensorflow-probability==0.11.0
tensorflow-text==2.3.0


$ pip freeze | grep jax
jax==0.1.75
jaxlib==0.1.52


$ python -V
Python 3.7.8

And:
Python 3.8.3

For bugs: reproduction and error logs

# Steps to reproduce:
from trax.rl_trainer import *
train_rl(output_dir='./models/acrobot', train_batch_size=32, eval_batch_size=32, n_epochs=1)

...
# Error logs:
Case 1: tensorflow==2.2.0:

Traceback (most recent call last):
  File "test.py", line 1, in <module>
    from trax import trainer_flags
  File "venv/lib/python3.7/site-packages/trax/__init__.py", line 18, in <module>
    from trax import data
  File "venv/lib/python3.7/site-packages/trax/data/__init__.py", line 20, in <module>
    from trax.data import tf_inputs
  File "venv/lib/python3.7/site-packages/trax/data/tf_inputs.py", line 28, in <module>
    from t5.data import preprocessors as t5_processors
  File "venv/lib/python3.7/site-packages/t5/__init__.py", line 17, in <module>
    import t5.data
  File "venv/lib/python3.7/site-packages/t5/data/__init__.py", line 17, in <module>
    import t5.data.mixtures
  File "venv/lib/python3.7/site-packages/t5/data/mixtures.py", line 22, in <module>
    import t5.data.tasks  # pylint: disable=unused-import
  File "venv/lib/python3.7/site-packages/t5/data/tasks.py", line 21, in <module>
    from t5.data.utils import Feature
  File "venv/lib/python3.7/site-packages/t5/data/utils.py", line 31, in <module>
    from t5.data import sentencepiece_vocabulary
  File "venv/lib/python3.7/site-packages/t5/data/sentencepiece_vocabulary.py", line 25, in <module>
    import tensorflow_text as tf_text
  File "venv/lib/python3.7/site-packages/tensorflow_text/__init__.py", line 21, in <module>
    from tensorflow_text.python import metrics
  File "venv/lib/python3.7/site-packages/tensorflow_text/python/metrics/__init__.py", line 20, in <module>
    from tensorflow_text.python.metrics.text_similarity_metric_ops import *
  File "venv/lib/python3.7/site-packages/tensorflow_text/python/metrics/text_similarity_metric_ops.py", line 28, in <module>
    gen_text_similarity_metric_ops = load_library.load_op_library(resource_loader.get_path_to_datafile('_text_similarity_metric_ops.so'))
  File "venv/lib/python3.7/site-packages/tensorflow/python/framework/load_library.py", line 58, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: venv/lib/python3.7/site-packages/tensorflow_text/python/metrics/_text_similarity_metric_ops.so: undefined symbol: _ZN10tensorflow6StatusC1ENS_5error4CodeEN4absl14lts_2020_02_2511string_viewE

Process finished with exit code 1


Case 2: tensorflow==2.3.0:

2020-08-04 10:51:16.285107: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
venv/lib/python3.7/site-packages/tensorflow_addons/utils/ensure_tf_install.py:68: UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.2.0 and strictly below 2.3.0 (nightly versions are not supported). 
 The versions of TensorFlow you are currently using is 2.3.0 and is not supported. 
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. 
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons
  UserWarning,
Traceback (most recent call last):
  File "test.py", line 12, in <module>
    train_rl(output_dir='./models/acrobot', train_batch_size=16, eval_batch_size=16, n_epochs=1)
  File "venv/lib/python3.7/site-packages/gin/config.py", line 1078, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "venv/lib/python3.7/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
    six.raise_from(proxy.with_traceback(exception.__traceback__), None)
  File "<string>", line 3, in raise_from
  File "venv/lib/python3.7/site-packages/gin/config.py", line 1055, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "venv/lib/python3.7/site-packages/trax/rl_trainer.py", line 99, in train_rl
    tf_np.set_allow_float64(FLAGS.tf_allow_float64)
  File "venv/lib/python3.7/site-packages/absl/flags/_flagvalues.py", line 491, in __getattr__
    raise _exceptions.UnparsedFlagAccessError(error_message)
absl.flags._exceptions.UnparsedFlagAccessError: Trying to access flag --tf_allow_float64 before flags were parsed.
  In call to configurable 'train_rl' (<function train_rl at 0x7f6383869b90>)

Process finished with exit code 1

...

davidjbuerger avatar Aug 04 '20 09:08 davidjbuerger