trafficstars

clusterNet

Deep neural network framework for GPU clusters:

supports NVIDIA GPUDirect RDMA
easy distributed computation:

Matrix C = dot(A,B); //uses one GPU
Matrix C = dotMPI(A,B); //uses all available GPUs on the board or in the network
no delay between batches due to asynchronous memory copies to the GPU:
gpu.init_batch_allocator(X, y, 128); for(int i = 0; i < gpu.m_total_batches; i++) { gpu.allocate_next_batch_async(); //loads the next batch while you do computations result = gpu.dot(gpu.m_current_batch_X,w1); //do your computations here gpu.replace_current_batch_with_next(); //get the next batch which is already loaded }
distributed weights which are larger than a single GPU memory:
ClusterNet gpus = ClusterNet(argc,argv,12346); Matrix *batch = gpus.rand(128,100000);//34 MB Matrix *out1 = empty(128,40000);//19 MB Matrix *out2 = empty(128,20000);//9 MB Matrix *W1 = gpus.distributed_uniformSqrtWeight(100000,40000);//15258 MB Matrix *W2 = gpus.distributed_uniformSqrtWeight(40000,20000);//3051 MB gpus.tick("Time taken"); gpus.dotMPI(batch,W1,out1); gpus.dotMPI(out1,W2,out2); gpus.tock("Time taken");

Time taken: 117.704285 ms.

clusterNet
clusterNet copied to clipboard