CaffeOnSpark
CaffeOnSpark copied to clipboard
SocketCaffeNet UT should be enhanced
After carefully check of the codebase, it seems that currently the SocketCaffeNet related UT is not so make sense. I'm curious that how did you ensure the correctness of your distributed training mode instantly and conveniently as the codebase is developing? Maybe there already exists work around except the CaffeOnSpark?
Thanks advance for any help:) @anfeng
Please explain why it doesn't "make sense". We will happy to enhance it as needed.
Be aware that SocketCaffeNet is a low-level API invoked by CaffeOnSpark via JNI.
- https://github.com/yahoo/CaffeOnSpark/blob/master/caffe-grid/src/main/scala/com/yahoo/ml/caffe/CaffeProcessor.scala#L76-L77
- https://github.com/yahoo/CaffeOnSpark/blob/master/caffe-distri/src/main/cpp/jni/JniCaffeNet.cpp#L47-L64
I'm basicaly familiar with the CaffeOnSpark codebase and have been developing on it for several months. What I mean is why not add a complete train test for socketnet
who's cluster_size >= 2
just like the localnet
?
I agree that we should expand the unit tests to simulate distributed training using SocketCaffeNet.
@fanshiqing any interest to work on it? We will be happy to review your contributions.
Thanks! @anfeng Actually for my case I have changed the native CaffeOnSpark code framework and now I need to verify the correctness of my changes so that it keeps working well for true distributed deep training just as the native CaffeOnSpark does. The basic test which using LocalCaffeNet has passed and more complicated tests which simulate distributed training locally should be carried out and be verified carefully. I have encountered some problems which haven't been addressed at present.