elasticdl
elasticdl copied to clipboard
Move training and evaluation related functions from module functions to member functions
Currently in order to train models, ElasticDL training worker requires following command line parameters:
-
model_def
:- Name of the class which inherits from
tf.keras.Model
such asmnist_module.MnistModel
. - Name of the function which returns a
tf.keras.Model
instance such asmnist_module.create_model
- Name of the class which inherits from
-
optimizer
: A function that returns a tensorflow optimizer such astf.train.GradientDescentOptimizer
-
loss
: A function that returns a tensor which represents loss. -
eval_metrics_fn
: A function that returns a dict. The key of the dict is the metric's name and the value is the metric's cooresponding tensor. -
dataset_fn
: A function that returns atf.data.Dataset
. It is the input of training.
Those parameters are names of function in the module where we define model_def
. ElasticDL uses those functions in different stages of training or evaluation. Those functions are defined as module functions. It has following limitations:
- It is hard to share information between functions unless we use global variables.
- It is hard to find out that one of those functions is not defined until ElasticDL calls it. And it is too late at that time because training or evaluation process may have been running for a long time.
As a result, we prefer a unified class to encapsulate them. Let's call it TrainSpec
.
class TrainSpec(object):
def __init__(self, context=None):
"""
Args:
context: a dict of contextual information from ElasticDL,
such as worker_id, master ip.
"""
self._context = context
@abstractmethod
def create_optimizer(self, lr=0.1):
"""returns a tensorflow optimizer such as `tf.train.GradientDescentOptimizer`"""
pass
@abstractmethod
def create_loss(self, outputs, labels):
"""returns a tensor represents loss using outputs and labels"""
pass
@abstractmethod
def create_metrics(self,
mode=Mode.TRAINING,
outputs=None,
labels=None,
predictions=None,):
"""returns a dict of metrics"""
pass
@abstractmethod
def create_model(self):
"""returns model created
Users can use functional API or subclass to create customized Keras model.
"""
class MnistTrainSpec(TrainSpec):
def create_optimizer(self, lr=0.1):
return tf.trian.GradientDescentOptimizer(0.1)
# other functions are left undefined purposely.
Users define MnistTrainSpec
which inherits from TrainSpec
. We will have at least two benefits:
- We can share Information between functions as they are all member functions of the same class.
- We can use abstract methods to help users to find out functions we require which are not implemented. ElasticDL will report
NotImplementedError
at the first time we try to instantiate the aboveMnistTrainSpec
.
Another design of TrainSpec
The above TrainSpec
assumes that different create_*
functions are isolated. If we need to share information between those functions, we have to put them in the constructor. Another drawback is that multiple calls to the same create_*
functions may return different instances. For example create_model
will return two instances of tf.keras.Model
if we call it twice.
Why we run into this dilemma ? Because we have done too more work for users. Users themselves have better understanding of their model/loss/optimizer. They know when to create an optimizer, what information to share between different functions such as loss and model.
What ElasticDL cares is only the results:
- the model instance
- the optimizer instance
- the loss function
- the dataset function
- the metric function
So we redefine TrainSpec
as an abstract base class with only abstract properties. If we want to train a mnist model, we can provide a class MnistSpec
. It implements all the abstract properties. Users can use functional API or subclass to implement all those properties and return them.
import abc
class TrainSpec(abc.ABC):
def __init__(self, context=None):
"""
Args:
context: A dict of contextual information from ElasticDL, such as worker_id, master ip
"""
self._context = context or {}
@property
@abstractmethod
def optimizer(self):
"""returns an instance of tensorflow optimizer, such as `tf.train.GradientDescentOptimizer`"""
@property
@abstractmethod
def loss_fn(self):
"""returns a loss function with parameters of `outputs` and `label`
def loss_fn(outputs, labels):
return tf.losses.mean_squared_error(labels, outputs)
You can return a function like`loss_fn`
"""
pass
@property
@abstractmethod
def model(self):
"""returns an instance of `tf.keras.Model` or instance of its subclass"""
pass
@property
@abstractmethod
def dataset_fn(self):
"""returns a function that generates a tf.data.Dataset instance
def dataset_fn(files):
# blabla
return dataset
You can return a dataset function like `dataset_fn`.
"""
@property
@abstractmethod
def metrics_fn(self):
"""returns an metric function with parameters
def metrics_fn(self,
mode=Mode.TRAINING,
outputs=None,
labels=None,
predictions=None,):
if mode == Mode.EVALUATION:
return {"mse": tf.metrics.accuracy(labels, predictions)}
else:
return {}
You can return a function like metrics_fn
"""
pass
class MnistModel(tf.keras.Model);
def call(self, inputs, training=False):
pass
def _loss_fn(outputs, labels):
retrurn tf.losses.mean_square_error(labels, outputs)
class MnistSpec(TrainSpec):
def __init__(self, context=None):
super(MnistSpec, self).__init__(context=context)
self._optimizer = tf.train.GradientOptimizer()
self._model = MnistModel()
@property
def optimizer(self):
return self._optimizer
@property
def model(self):
return self._model
@property
def loss_fn(self)
return _loss_fn
# other properties are left undefined purposely.
Users can create model/optimizer/loss_fn/dataset_fn anywhere. The above code defines variable used by property directly in MnistSpec
's constructor.
This new design will give us the same instance even if we access any property multiple times.
I agree that those arguments are not ideal and scalable. I remember these were suggested by the high-level API design here so users can configure different models, inputs, optimizer, etc. cc @wangkuiyi @ywskycn @skydoorkai
I am trying to understand https://github.com/wangkuiyi/elasticdl/issues/1006#issue-478372383. This follows my understanding.
Currently, the function elasticdl.train
requires the following parameters:
-
model_def
: the model definition as atf.keras.Model
-derived class or a function that returns the inputs and outputs layers, -
optimizer
: a function that returns a TensorFlow optimizer, -
loss
: a function that returns the loss value given a minibatch of outputs and truth, -
eval_metrics
: a function that (I don't know eval_metrics_fn and have no idea how to describe it), and -
dataset
: a function that creates atf.data.Dataset
instance.
The above bullets form the settings of a training job. This inspires us to give them a single name and we can pass it as a single parameter to elasticdl.train
.
We can define them as classes and/or functions in a module and refer to the bullets as the module name. However, in some cases, we might need to define variables for information sharing or exchange between these bullets. If we use a module name, we'd have to define these variables as global variables in the module. However, global variables are often problematic.
An alternative is to define the bullets as nested-classes and/or methods of a class, so we can define the variables as class members. We prefer this way.
To better describe the idea, let us have an example. Suppose that we want to train a model defined as a tf.keras.Model
-derived class EasyAndHappy
using the MNIST dataset, we can define the following class:
class EasyAndHappyTrainerWithMNIST(elasticdl.Training):
class EasyAndHappy(tf.keras.Model):
def __init__():
create_some_parameters_here()
def call(...):
do_something_here()
def dataset():
train, test = tf.keras.datasets.mnist.load_data()
return train
def cost(...):
return_some_cost()
@wangkuiyi Just a nit pick here: dataset()
should probably be separated from the model.
Is it necessary that creating a separate Spec
for ElasticDL?
The short answer is yes.
First let's review the example in ElasticDL's model zoo we mentioned during last meeting: model_zoo/deepfm_functional_api/deepfm_functional_api.py.
In that module, we need to create a global variable named AUC_metric
in order to share information between our model and our metric function. In the metric function, it will use AUC_metric
to calculate auc. Global variables sometimes mean a bad design.
AUC_metric
is an instance of tf.keras.metrics.AUC
. It has three member functions that manipulates its data:
-
update_state
: Update its internal state (variables that this metric creates) -
reset
: Reset its internal state. -
result
: Get the final result of this metric.
When we create an instance of tf.keras.metrics.AUC
, we continuously update its state and call result()
to get the final result. So we have to reuse the same instance in the process of a single evaluation (processing a complete loop of evaluation dataset).
We create the function we use to calculate metrics in a function like create_metrics_fn
. The function will use metrics such as tf.keras.metrics.AUC
to calculate metric. Then ElasticDL will keep calling the function in evaluation_process
when our worker receives evaluation tasks. Two potential problems can happen in evaluation_process
function:
- We decorate the
evaluation_process
withtf.function
: It will raiseError
. Because we can not create variable in the same function during multiple calls. Tensorflow Issue - If we do not decorate
evaluation_process
withtf.function
, each time we call the function, we create new variables, which will not accumulate the data. The final result is not correct.
Take products recommendation as another example. Users initialize a weight
vector for different product types and they also want to calculate a weighted loss according to this weight vector. It means that we have to share weight
between loss
and create_model
functions. If we do not introduce a new Spec
class, we also have to use global variables to share this weight vector.
Notice
Current model evaluation process is not correct for metrics which we cannot average. For auc metrics, each worker will calculate auc based on tasks it receives and report auc. Master will collect all those metrics and average them. It is not correct. If we still use tf.keras.metrics.AUC
to calculate auc metric, we have to update evaluation:
- Workers just report predictions and labels
- Master can execute one of the following actions:
- It collects all predictions and labels and call
tf.keras.metrics.AUC
once for all the data. We do not have to aggregate them. - It resets the
tf.keras.metrics.AUC
at the beginning, and callsupdate_state
when it receives predictions and labels (It does not need all the predictions and labels to start calculation). When we finish iterating evaluation data, we get the final result throughresult()
.
- It collects all predictions and labels and call
If we agree on previous update, for the case of tf.keras.metrics.AUC
, we do not need Spec
. But there are other situations that it is easy to to share variables by introducing a Spec
class without using global variables.