keras
keras copied to clipboard
No training speed improvement can be obtained by using multi-gpus with mxnet as the backend
Hi, I have some questions about the training speed when using multi-gpus with mxnet as the backend for keras. According to https://mxnet.incubator.apache.org/how_to/multi_devices.html, which said "By default, MXNet partitions a data batch evenly among the available GPUs. Assume a batch size b and assume there are k GPUs, then in one iteration each GPU will perform forward and backward on b/k examples. The gradients are then summed over all GPUs before updating the model." I think when the batch size b is fixed, each gpu calculates gradients on b/k examples, compared to the gradients calculation on b examples with single gpu, the former should comsume less time. As a result, with the same batch size, the speed of weights updating by using multi-gpus should be faster than that by using single gpu for each iteration. But through the experiments, I found the speed of training using multi-gpus is slower than that using single gpu .
below are parts of my code, where I used the fully-connected network
model = Sequential() model.add(Dropout(0.1,input_shape=(2056,))) model.add(Dense(2800,activation='relu')) model.add(Dropout(0.1)) model.add(Dense(2800,activation='relu')) model.add(Dropout(0.1)) model.add(Dense(2800,activation='relu')) model.add(Dropout(0.1)) model.add(Dense(257)) model.summary() opt=SGD() NUM_GPU = 4 gpu_list = [] for i in range(NUM_GPU): gpu_list.append('gpu(%d)' % i) batch_size=128 model.compile(loss=my_loss, optimizer=opt, context=gpu_list)
I don't know whether my understanding is right, why no speed improvment can be obatined with multi-gpus? Can anyone solve my questions? Thanks!
Below is the training process with 1 gpu and 4 gpus respectively,
1 gpu:
4 gpus:
It seems that the training with 4 gpus has faster convergence speed, but requires more time for each epoch.
Can you provide full codes for your experiment? Sometimes multi-cpu won't get any boost or can even slow down training since overhead of hardware communication.
Ok, my code is shown as follows:
import numpy as np
np.random.seed(1337) # for reproducibility
from keras.models import Sequential
from keras.layers.core import Dense, Dropout
from keras.optimizers import SGD
import numpy as np
from sklearn import preprocessing
import random
from keras import backend as K
def my_loss(y_true,y_pred):
term1=K.sum(K.square(y_pred[:,:257]-y_true[:,:257]),axis=-1)
term2=K.sum(K.square(y_pred[:,257:350]-y_true[:,257:350]),axis=-1)
term3=K.sum(K.square(y_pred[:,350:]-y_true[:,350:]),axis=-1)
return 0.5*term1+0.3*term2+0.2*term3
data_dir='/work/Wendison/training_data/'
NameX=[]
NameY=[]
Numxy=[]
##As the training data is too big(>100G), I divided it into 20 file pairs (input+label)
for j in range(1,21):
NameX.append(data_dir+'Xtrain'+str(j)+'.npy') # the path for input data of DNN
NameY.append(data_dir+'Ytrain'+str(j)+'.npy') # the path for label data of DNN
Numxy.append(data_dir+'Num'+str(j)+'.npy') # the path for number of samples for each file
meanx=np.load('meanx.npy')
stdx=np.load('stdx.npy')
meany=np.load('meany.npy')
stdy=np.load('stdy.npy')
scalerx=preprocessing.StandardScaler()
scalery=preprocessing.StandardScaler()
scalerx.mean_=meanx
scalerx.scale_=stdx
scalery.mean_=meany
scalery.scale_=stdy
##use the last data pair as the validation data
tempx=np.load(NameX[-1])
tempy=np.load(NameY[-1])
X_val=scalerx.transform(tempx)
Y_val=scalery.transform(tempy)
NameX.pop()
NameX.pop()
Numxy.pop()
batch_size=128
Num=len(Numxy)
numall=0
for i in range(len(Numxy)):
nn=np.load(Numxy[i])
numall+=sum(nn) # compute the number of overall training samples
##define a data generator to read training data
def mygenerator(batch_size=batch_size,num=30):
num=range(Num)
random.shuffle(num) # shuffle the order of training files
for i in num:
tempx=np.load(NameX[i])
tempy=np.load(NameY[i])
X_train=scalerx.transform(tempx)
Y_train=scalery.transform(tempy)
numxy=np.load(Numxy[i])
orde=range(X_train.shape[0])
random.shuffle(orde)
X_train=X_train[orde,:]
Y_train=Y_train[orde,:] # shuffle the order of samples in each data file
numb=numxy/batch_size
while 1:
for ii in range(numb):
if ii<numb-1:
yield X_train[ii*batch_size:(ii+1)*batch_size,:], Y_train[ii*batch_size:(ii+1)*batch_size,:]
else:
yield X_train[batch_size*ii:,:],Y_train[batch_size*ii:,:]
##model definition
model = Sequential()
model.add(Dropout(0.1,input_shape=(2056,)))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(2800,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(607))
model.summary()
opt=SGD()
NUM_GPU = 4
gpu_list = []
for i in range(NUM_GPU):
gpu_list.append('gpu(%d)' % i)
model.compile(loss=my_loss,optimizer=opt, context=gpu_list)
mygen=mygenerator()
for i in range(1,101):
model.fit_generator(mygen,samples_per_epoch=numall, nb_epoch=1, verbose=1,
validation_data=(X_val, Y_val))
The training data is very large(>100G), so I divide it into 20 file pairs, and load the data periodically for each epoch by using the generator of keras, is that related to the speed of training via multi-gpus? Thanks! @kevinthesun
@Wendison You can benchmark pure training time without data IO to see if data IO is the bottleneck.