arrayfire-ml icon indicating copy to clipboard operation
arrayfire-ml copied to clipboard

TODO List for 0.1 release

Open pavanky opened this issue 9 years ago • 53 comments

Base Classes

Autograd

Neural Network

Solvers / Optimizers

Examples

pavanky avatar Jun 17 '15 22:06 pavanky

@pavanky : It doesn't necessarily have to be a classifier though. Neural networks can easily do regression type problems as well. Only difference would be the loss function used. I.e. L2 loss(regression) vs. Cross Entropy+Softmax(classification).

jramapuram avatar Jun 18 '15 06:06 jramapuram

I think the following would be helpful from an API standpoint:

struct Model {
  int add(Layer layer_type);
  int compile(Optimizer opt, Loss loss, int max_iter = 200, bool early_stop = 1);
  float fit(DataSet train_data, DataSet target_data, std::tuple<DataSet, DataSet> validation = std::nullptr, float validation_split = 0.0f);  // for batch training up till either max_iter or early_stop if it is set. Has the option of either accepting cross validation data or splitting given data with ratio
  float train(DataSet train_data, DataSet target_data); // single step for online methods (can be called from fit)
  DataSet predict(DataSet test_data); // for evaluating new data
}

All the int's are return codes in the above.

This will give maximum flexibility in:

  1. Layer creation
  2. Model training
  3. Online + batch setting

The Layer class as you mentioned should do the following:

struct Layer{
  int connect(Layer prev); // to connect to previous layer in a deep net
  DataSet derivative(DataSet x);
  DataSet forwardPass(DataSet x);
  DataSet input(); //merely returns the input that was last used (or the output of the previous layer)
  Weights weights(); //returns just weights
  Bias bias(); // returns the bias (or stack of bias') if any (otherwise std::nullptr maybe? )
  std::map<std::string, std::string> conf(); //getter to return config of the layer itself.
}

jramapuram avatar Jun 18 '15 09:06 jramapuram

According to many recent researches, the most powerful neural networks are no longer stacked layers but rather arbitrarily complex graphs, e.g. the bunches of advanced recursive neural networks, Facebook AI research's Memory Networks, Google DeepMind's Neural Turing Machine etc. So node is a more general name than layer.

connect can't reflect the direction of connections between the nodes. Except for the nodes that connect to the input data, the other nodes' input are the output of their predecessor nodes. Caffe uses Google Protocol Buffer to define and serialize the network. Apache Thrift which was open sourced by Facebook supports many more languages.

The following API is inspired by Caffe's Blob, Layer and Net.

typedef  shared_ptr<array> ArrayPtr;

class Data {
  public:
    explicit Data(vector<int>& size);
    int nDimension() const;
    vector<int> size();
    // Caffe exposes the raw CPU/GPU pointers to use in BLAS functions.
    // array has high level API. So there's no need to to so.
    ArrayPtr weights() const;
    ArrayPtr gradients() const;
}

typedef shared_ptr<Data> DataPtr;
typedef vector<DataPtr> DataPtrVec;

class Node {
 public:
  explicit Node(NodeParam& nodeParam);
  virtual ~Node();
  // Calls initNode which subclass can override
  void init();
  // Input and output are more general than the top and bottom of Caffe
  virtual void forward(const DataPtrVec& input, 
      const DataPtrVec& output);
  // propagate_back is more general than propagate_down of Caffe
  virtual void backward(const DataPtrVec& input,
      const vector<bool>& propagate_back,
      const DataPtrVec& output);
  // The model is DAG(Directed Acyclic Graph)
  // it's more intuitive for the predecessor to add the successor
  void addSuccessor(Node& node);
  void addSuccessors(vector<Node>& nodes);  
 protected:
  virtual initNode();
};

// Dtype is float or double
template <typename Dtype>
class Graph {
 public:
  explicit Graph(GraphParam& graphParam);
  virtual ~Graph();
  virtual forward(const DataPtrVec& inputs, DataPtrVec* outputs,
      Dtype* loss = NULL);
 /**
   * (Caffe) The network backward should take no input and output, since it solely
   * computes the gradient w.r.t the parameters, and the data has already been
   * provided during the forward pass.
   */
  virtual backward();
  Dtype forwardBackward(const DataPtrVec& inputs) {
    Dtype loss;
    DataPtrVec outputs;
    forward(inputs, &outputs, &loss);
    backward();
    return loss;
  }
};

futurely avatar Jun 22 '15 20:06 futurely

Microsoft Research's "Computational Networks A Generalization of Deep Learning Models" presented its open source deep learning framework CNTK as image image image image

futurely avatar Jun 23 '15 00:06 futurely

I like the generalizations. Few notes:

  void addSuccessor(Node& node);

seems redundant with:

  void addSuccessors(vector<Node>& nodes);

since a node can have subnodes as well, right? Why not just have a ptr to the next and previous nodes: a basic doubly-linked list.

One item that is very important that your schema is missing is some form of accuracy. 99% of the time you will need some form of data splitting and verification vs. a cross-validation data set. It would be a good idea to incorporate this right off the back.

jramapuram avatar Jun 23 '15 06:06 jramapuram

@jramapuram

If I am reading correctly, the following is for cases when a single node is connected to multiple successors (like the CN with shared params diagram).

void addSuccessors(vector<Node>& nodes);

I am not sure a linked list is the solution here.

pavanky avatar Jun 23 '15 07:06 pavanky

@futurely Thanks for the great feedback! The proposed API looks solid. ~~However the one issue I see is that going the Node / Graph route means that the amount of parallelism will be decreased in networks that use more traditional layers. I wonder if we can specialize a Layer class as well that sits on top of Node to achieve this.~~

pavanky avatar Jun 23 '15 07:06 pavanky

@futurely Sorry for jumping the gun. I re-read the entire discussion again. It looks the proposed Node is just a generalized Layer.

pavanky avatar Jun 23 '15 07:06 pavanky

@futurely @jramapuram Would it be possible to continue the discussion over here: https://gitter.im/arrayfire/arrayfire_ml ?

pavanky avatar Jun 23 '15 11:06 pavanky

Suggestions from @alcinos:

  • Instead of having a list of child nodes in eachNode, it may be better to have an adjacency list in the Network class.
  • The weights and gradients (and I guess you also meant the biases) shouldn't be part of the input parameters of the "forward" function, they should be private to the node (of course, some getters can be written)

pavanky avatar Jun 25 '15 16:06 pavanky

Here is a proposition including the modifications: https://gist.github.com/alcinos/3bedb2f7c4518fa93220

alcinos avatar Jun 25 '15 18:06 alcinos

The Graph or Network class is not needed at all. Here's a simple illustration:

structure node
   [list of nodes] neighbors
   [data]
end

cost(X, Y) := if (X.neighbors contains Y) return X.neighbors[Y];
           else "Not possible"

list nodes;

futurely avatar Jun 26 '15 02:06 futurely

@futurely The point is that the net structure should be independant from the nodes. The same node can be involved in different topologies depending on the context. For example, in stacked auto encoders, the training is done layer by layer, which requires a different topology to train each level

alcinos avatar Jun 26 '15 04:06 alcinos

#22.

futurely avatar Jun 28 '15 08:06 futurely

@alcinos Can you explain how having an adjacency list helps in the situations you mentioned ? I still think it is the better option to have a centralized location for the representation, however I do not see it solving the problems of greedy layer by layer training.

pavanky avatar Jun 30 '15 11:06 pavanky

@pavanky Well let's say we have a 3 layers stacked auto-encoder. I will denote by "O" an output layer, "I" an input layer, "E" an encoding layer and "D" a decoding layer. The first step is to train greedily the first layer. This training is performed on the network : I -> E1 -> D1 -> O. After some number of iterations (or once the reconstruction error goes bellow a given threshold), we can train the second layer. This time, the network is I -> E1 -> E2 -> D2 -> O (we want the second layer to reconstruct the encoding of the first one) And so on for the others layers. The last step is a finetuning of the weights (by gradient descent), performed on the net: I -> E1 -> E2 -> E3 -> D3 -> D2 -> D1 -> O

Eventually, depending on the applications, it is likely that the interesting part of the trained net is the output of E3 (high level features of the input). Once trained, we'll thus only use the first part of the net: I -> E1 -> E2 -> E3

The point is that in all those training steps, the architecture of the net is different, hence is makes more sense to store this architecture independently of the nodes. Moreover, several architectures can be used more or less concurrently for the same nodes: for example, we can use the net I -> E1 -> E2 -> E3 as a feature generator for some problem (control, classification,...), while constently improving the reconstruction error (of the full net I -> E1 -> E2 -> E3 -> D3 -> D2 -> D1 -> O) given some new samples that comes from experience (the training set of the net is not always fully available from the beginning)

alcinos avatar Jun 30 '15 16:06 alcinos

I understand what autogenerators are doing. My question was more about implementation. What you are suggesting requires updating the adjacency list after each step or creating a new network after each step. Am I correct in this observation ?

pavanky avatar Jun 30 '15 16:06 pavanky

Absolutely!

alcinos avatar Jun 30 '15 16:06 alcinos

In that case I think we can have a specialized class for auto encoders (and RBMs) that will do the greedy training and return the encoding part of the network to be included into another network on request.

With that said, I think the other representation of the network (each node having list of links) can also be made to work in this situation (may be even a bit more cleanly) if the edges are labeled properly.

pavanky avatar Jun 30 '15 17:06 pavanky

In the setting where each node has a list of links, how would you deal with the same node belonging to several nets (simoultaneously) ?

alcinos avatar Jun 30 '15 17:06 alcinos

An inverted multi leaf tree structure would make sense for this (i.e becoming inputs to multiple nodes) and then degenerate to a list. Decoupling a model from the node/layer structure does make sense.

On Tue, Jun 30, 2015, 7:21 PM alcinos [email protected] wrote:

In the setting where each node has a list of links, how would you deal with the same node belonging to several nets (simoultaneously) ?

— Reply to this email directly or view it on GitHub https://github.com/arrayfire/arrayfire_ml/issues/17#issuecomment-117266859 .

jramapuram avatar Jun 30 '15 17:06 jramapuram

@alcinos Each node can be linked to any number of other nodes. So you begin with adding all needed links. You then label which Networks the link belongs to. When training a particular network, you just traverse the relevant links and remove them when you are done.

pavanky avatar Jun 30 '15 18:06 pavanky

@pavanky mmh it seems that it can rapidly become a mess, especially because each entity that wants to use a given node in a net has to resolve label conflicts that may arise on its own (which is not good in term of encapsulation), and we also have to trust it on properly cleaning the links when the net is no longer needed (which is not good either).

alcinos avatar Jun 30 '15 19:06 alcinos

Indeed. Just put the option out there. I think I have most of the information I need to push something out in the next couple of days.

pavanky avatar Jun 30 '15 20:06 pavanky

Cool, I'm looking forward to take a look at it, and contribute if I have the opportunity

alcinos avatar Jun 30 '15 20:06 alcinos

Examples of RBM implemented with high level math API from cudamat are as follows. https://github.com/cudamat/cudamat/blob/master/examples/rbm_cudamat.py https://github.com/cudamat/cudamat/blob/master/examples/rbm_numpy.py

futurely avatar Jul 01 '15 03:07 futurely

@futurely The code you are showing doesn't feature any encapsulation. This is a problem:

  • What if we want to change the number of hidden layers?
  • What if we would like to use convolution layers instead of fully connected ones ?
  • ...

alcinos avatar Jul 01 '15 04:07 alcinos

Caffe creator's plan on "Improving Caffe: Some Refactoring".

futurely avatar Jul 13 '15 14:07 futurely

@futurely @alcinos @jramapuram I pushed some of the basic classes out. I do not have any convolution nodes yet. There is no network classes yet. I am going to push some sample network classes out later today to demonstrate how the structure can be stored.

The network class will also extend Node so that it can be used as a super node in a larger network.

I understand that this is not much right now, but feedback is welcome.

pavanky avatar Jul 14 '15 18:07 pavanky