elasticdl icon indicating copy to clipboard operation
elasticdl copied to clipboard

Move training and evaluation related functions from module functions to member functions

Open chunyang-wen opened this issue 5 years ago • 4 comments

Currently in order to train models, ElasticDL training worker requires following command line parameters:

  • model_def:
    • Name of the class which inherits from tf.keras.Model such as mnist_module.MnistModel.
    • Name of the function which returns a tf.keras.Model instance such as mnist_module.create_model
  • optimizer: A function that returns a tensorflow optimizer such as tf.train.GradientDescentOptimizer
  • loss: A function that returns a tensor which represents loss.
  • eval_metrics_fn: A function that returns a dict. The key of the dict is the metric's name and the value is the metric's cooresponding tensor.
  • dataset_fn: A function that returns a tf.data.Dataset. It is the input of training.

Those parameters are names of function in the module where we define model_def. ElasticDL uses those functions in different stages of training or evaluation. Those functions are defined as module functions. It has following limitations:

  • It is hard to share information between functions unless we use global variables.
  • It is hard to find out that one of those functions is not defined until ElasticDL calls it. And it is too late at that time because training or evaluation process may have been running for a long time.

As a result, we prefer a unified class to encapsulate them. Let's call it TrainSpec.

class TrainSpec(object):
      def __init__(self, context=None):
          """
          Args:
              context: a dict of contextual information from ElasticDL,
                             such as worker_id, master ip.
          """
          self._context = context

     @abstractmethod
     def create_optimizer(self, lr=0.1):
         """returns a tensorflow optimizer such as `tf.train.GradientDescentOptimizer`"""
         pass

     @abstractmethod
     def create_loss(self, outputs, labels):
        """returns a tensor represents loss using outputs and labels"""
        pass

     @abstractmethod
     def create_metrics(self,
                         mode=Mode.TRAINING,
                         outputs=None,
                         labels=None,
                         predictions=None,):
          """returns a dict of metrics"""
          pass

     @abstractmethod
     def create_model(self):
          """returns model created
          Users can use functional API or subclass to create customized Keras model.
          """

class MnistTrainSpec(TrainSpec):
    def create_optimizer(self, lr=0.1):
        return tf.trian.GradientDescentOptimizer(0.1)
    # other functions are left undefined purposely.

Users define MnistTrainSpec which inherits from TrainSpec. We will have at least two benefits:

  • We can share Information between functions as they are all member functions of the same class.
  • We can use abstract methods to help users to find out functions we require which are not implemented. ElasticDL will report NotImplementedError at the first time we try to instantiate the above MnistTrainSpec.

Another design of TrainSpec

The above TrainSpec assumes that different create_* functions are isolated. If we need to share information between those functions, we have to put them in the constructor. Another drawback is that multiple calls to the same create_* functions may return different instances. For example create_model will return two instances of tf.keras.Model if we call it twice.

Why we run into this dilemma ? Because we have done too more work for users. Users themselves have better understanding of their model/loss/optimizer. They know when to create an optimizer, what information to share between different functions such as loss and model.

What ElasticDL cares is only the results:

  • the model instance
  • the optimizer instance
  • the loss function
  • the dataset function
  • the metric function

So we redefine TrainSpec as an abstract base class with only abstract properties. If we want to train a mnist model, we can provide a class MnistSpec. It implements all the abstract properties. Users can use functional API or subclass to implement all those properties and return them.

import abc
class TrainSpec(abc.ABC):
     def __init__(self, context=None):
     """
     Args:
         context: A dict of contextual information from ElasticDL, such as worker_id, master ip
     """
     self._context = context or {}

     @property
     @abstractmethod
    def optimizer(self):
        """returns an instance of tensorflow optimizer, such as `tf.train.GradientDescentOptimizer`"""

     @property
     @abstractmethod
    def loss_fn(self):
        """returns a loss function with parameters of `outputs` and `label`
       def loss_fn(outputs, labels):
              return tf.losses.mean_squared_error(labels, outputs)
       You can return  a function like`loss_fn`
        """
       pass

     @property
     @abstractmethod
      def model(self):
        """returns an instance of `tf.keras.Model` or instance of its subclass"""
       pass
  
     @property
     @abstractmethod
     def dataset_fn(self):
     """returns a function that generates a tf.data.Dataset instance
     def dataset_fn(files):
         # blabla
         return dataset
     You can return a dataset function like `dataset_fn`.
     """

     @property
     @abstractmethod
     def metrics_fn(self):
        """returns an metric function with parameters
         def metrics_fn(self,
                         mode=Mode.TRAINING,
                         outputs=None,
                         labels=None,
                         predictions=None,):
             if mode == Mode.EVALUATION:
                 return {"mse": tf.metrics.accuracy(labels, predictions)}
             else:
                 return {}
         You can return a function like metrics_fn
        """
        pass


class MnistModel(tf.keras.Model);
     def call(self, inputs, training=False):
         pass

def _loss_fn(outputs, labels):
    retrurn tf.losses.mean_square_error(labels, outputs)

class MnistSpec(TrainSpec):
      def __init__(self, context=None):
          super(MnistSpec, self).__init__(context=context)
          self._optimizer = tf.train.GradientOptimizer()
          self._model = MnistModel()
      
      @property
      def optimizer(self):
           return self._optimizer

      @property
      def model(self):
           return self._model

      @property
      def loss_fn(self)
          return _loss_fn
      # other properties are left undefined purposely.

Users can create model/optimizer/loss_fn/dataset_fn anywhere. The above code defines variable used by property directly in MnistSpec's constructor.

This new design will give us the same instance even if we access any property multiple times.

chunyang-wen avatar Aug 08 '19 10:08 chunyang-wen

I agree that those arguments are not ideal and scalable. I remember these were suggested by the high-level API design here so users can configure different models, inputs, optimizer, etc. cc @wangkuiyi @ywskycn @skydoorkai

terrytangyuan avatar Aug 08 '19 13:08 terrytangyuan

I am trying to understand https://github.com/wangkuiyi/elasticdl/issues/1006#issue-478372383. This follows my understanding.

Currently, the function elasticdl.train requires the following parameters:

  • model_def: the model definition as a tf.keras.Model-derived class or a function that returns the inputs and outputs layers,
  • optimizer: a function that returns a TensorFlow optimizer,
  • loss: a function that returns the loss value given a minibatch of outputs and truth,
  • eval_metrics: a function that (I don't know eval_metrics_fn and have no idea how to describe it), and
  • dataset: a function that creates a tf.data.Dataset instance.

The above bullets form the settings of a training job. This inspires us to give them a single name and we can pass it as a single parameter to elasticdl.train.

We can define them as classes and/or functions in a module and refer to the bullets as the module name. However, in some cases, we might need to define variables for information sharing or exchange between these bullets. If we use a module name, we'd have to define these variables as global variables in the module. However, global variables are often problematic.

An alternative is to define the bullets as nested-classes and/or methods of a class, so we can define the variables as class members. We prefer this way.

To better describe the idea, let us have an example. Suppose that we want to train a model defined as a tf.keras.Model-derived class EasyAndHappy using the MNIST dataset, we can define the following class:

class EasyAndHappyTrainerWithMNIST(elasticdl.Training):
    class EasyAndHappy(tf.keras.Model):
        def __init__():
            create_some_parameters_here()
        def call(...):
            do_something_here()

    def dataset():
        train, test = tf.keras.datasets.mnist.load_data()
        return train

    def cost(...):
        return_some_cost()

wangkuiyi avatar Aug 22 '19 08:08 wangkuiyi

@wangkuiyi Just a nit pick here: dataset() should probably be separated from the model.

terrytangyuan avatar Aug 22 '19 14:08 terrytangyuan

Is it necessary that creating a separate Spec for ElasticDL?

The short answer is yes.

First let's review the example in ElasticDL's model zoo we mentioned during last meeting: model_zoo/deepfm_functional_api/deepfm_functional_api.py.

In that module, we need to create a global variable named AUC_metric in order to share information between our model and our metric function. In the metric function, it will use AUC_metric to calculate auc. Global variables sometimes mean a bad design.

AUC_metric is an instance of tf.keras.metrics.AUC. It has three member functions that manipulates its data:

  • update_state: Update its internal state (variables that this metric creates)
  • reset: Reset its internal state.
  • result: Get the final result of this metric.

When we create an instance of tf.keras.metrics.AUC, we continuously update its state and call result() to get the final result. So we have to reuse the same instance in the process of a single evaluation (processing a complete loop of evaluation dataset).

We create the function we use to calculate metrics in a function like create_metrics_fn. The function will use metrics such as tf.keras.metrics.AUC to calculate metric. Then ElasticDL will keep calling the function in evaluation_process when our worker receives evaluation tasks. Two potential problems can happen in evaluation_process function:

  • We decorate the evaluation_process with tf.function: It will raise Error. Because we can not create variable in the same function during multiple calls. Tensorflow Issue
  • If we do not decorate evaluation_process with tf.function, each time we call the function, we create new variables, which will not accumulate the data. The final result is not correct.

Take products recommendation as another example. Users initialize a weight vector for different product types and they also want to calculate a weighted loss according to this weight vector. It means that we have to share weight between loss and create_model functions. If we do not introduce a new Spec class, we also have to use global variables to share this weight vector.

Notice Current model evaluation process is not correct for metrics which we cannot average. For auc metrics, each worker will calculate auc based on tasks it receives and report auc. Master will collect all those metrics and average them. It is not correct. If we still use tf.keras.metrics.AUC to calculate auc metric, we have to update evaluation:

  • Workers just report predictions and labels
  • Master can execute one of the following actions:
    • It collects all predictions and labels and call tf.keras.metrics.AUC once for all the data. We do not have to aggregate them.
    • It resets the tf.keras.metrics.AUC at the beginning, and calls update_state when it receives predictions and labels (It does not need all the predictions and labels to start calculation). When we finish iterating evaluation data, we get the final result through result().

If we agree on previous update, for the case of tf.keras.metrics.AUC, we do not need Spec. But there are other situations that it is easy to to share variables by introducing a Spec class without using global variables.

chunyang-wen avatar Aug 30 '19 05:08 chunyang-wen