Chinthaka
Chinthaka
To get rid of the memory issue you come with up when ./train.sh you have to either replace caffe/layers/cudnn_conv_layer.cpp and caffe/layers/cudnn_conv_layer.cu with the files provided in this repo and rebuild...
I think you are not using cudnn 7.0 with "engine: CAFFE" switch. Make sure you have correctly installed cudnn7 and correctly linked cudnn.so to cudnn.so.7.0 or something. Don't use "engine:...
@maxritter Did you find any solutions. I didn't changed 91 to 2 in dump_tensorflow_weights.py file. After changing that I get the error "cannot reshape array of size 157248 into shape...
@maxritter so you didn't try this MobilenetV2, but the MobileNetV1 by chuanqi305, did you?
@baiboat @tringn please look at the #22 I continued my work with MobileNetV1. Stopped at this error. Ha vent tried yet with solutions in #22
@Ricardicus @ademeure still hangs. changing nccl stream to main_stream doesn’t help. Will further look into multi stream usage and memcopyasyncs.
Found the bug. `common_start` always set the gpu to idx 0. Doesn't take in MultiGPU config. Working on the fix. @pjj thanks for the analysis above. Also it's better to...
@karpathy Thank you for reviewing. Eager to have a look at 1) gradient accumulation and 2) gradient clipping to see if I can contribute. I refactored the PR to cater...
@karpathy I’ve fixed the previous CI issues. Waiting until multi-gpu hanging issue get resolved.
@karpathy I added changes to shard master weights and remove previous unnecessary all-gather function for master weights. ty.