clusterNet icon indicating copy to clipboard operation
clusterNet copied to clipboard

Deep neural network framework for multiple GPUs

trafficstars

clusterNet

Deep neural network framework for GPU clusters:

  • supports NVIDIA GPUDirect RDMA

  • easy distributed computation:

    Matrix C = dot(A,B); //uses one GPU
    Matrix C = dotMPI(A,B); //uses all available GPUs on the board or in the network

  • no delay between batches due to asynchronous memory copies to the GPU:
    gpu.init_batch_allocator(X, y, 128);
    for(int i = 0; i < gpu.m_total_batches; i++)
    {
    gpu.allocate_next_batch_async(); //loads the next batch while you do computations
    result = gpu.dot(gpu.m_current_batch_X,w1); //do your computations here
    gpu.replace_current_batch_with_next(); //get the next batch which is already loaded
    }

  • distributed weights which are larger than a single GPU memory:

    ClusterNet gpus = ClusterNet(argc,argv,12346);
    Matrix *batch = gpus.rand(128,100000);//34 MB
    Matrix *out1 = empty(128,40000);//19 MB
    Matrix *out2 = empty(128,20000);//9 MB
    Matrix *W1 = gpus.distributed_uniformSqrtWeight(100000,40000);//15258 MB
    Matrix *W2 = gpus.distributed_uniformSqrtWeight(40000,20000);//3051 MB
    gpus.tick("Time taken");
    gpus.dotMPI(batch,W1,out1);
    gpus.dotMPI(out1,W2,out2);
    gpus.tock("Time taken");

Time taken: 117.704285 ms.