singa icon indicating copy to clipboard operation
singa copied to clipboard

SINGA-140: Fixed bug in CollectAll() function

Open raunaqabhyankar opened this issue 8 years ago • 8 comments

In SINGA_HOME/src/worker.cc, in “int Worker::CollectAll(int step, NeuralNet* net){}” function, the layers which are unrolled (except for the first one) should not collect parameters, due to parameter sharing.

Previous: if (layer->partition_id() == id_) Current changes: if (layer->partition_id() == id_ && layer->unroll_index() == 0)

@kaiping

raunaqabhyankar avatar Mar 30 '16 06:03 raunaqabhyankar

Would you please change the commit message to follow this format "SINGA-xxx <JIRA Title>"? Have you tried to run the char-rnn example after this commit?

nudles avatar Mar 30 '16 09:03 nudles

I'll change the commit message. I haven't run the example. Can u pls tell me how to do that? Thanks.

raunaqabhyankar avatar Mar 30 '16 10:03 raunaqabhyankar

here are the instructions: http://singa.apache.org/docs/general-rnn.html

On Wed, Mar 30, 2016 at 6:24 PM, Raunaq Abhyankar [email protected] wrote:

I'll change the commit message. I haven't run the example. Can u pls tell me how to do that? Thanks.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/apache/incubator-singa/pull/141#issuecomment-203369378

nudles avatar Mar 30 '16 12:03 nudles

Dear sir, Hi! Could you please tell me what the steps for execution and the expected output should be? I went through (http://singa.apache.org/docs/general-rnn.html) but did not understand properly. Thanks... :)

raunaqabhyankar avatar Apr 04 '16 15:04 raunaqabhyankar

have you tried to run the example? the instructions are similar to that of other examples (we have provided the job.conf file in the example/char-rnn). Pls paste your output here.

nudles avatar Apr 04 '16 15:04 nudles

Original Code (no changes): $ ./bin/singa-run.sh -conf examples/char-rnn/job.conf -test Unique JOB_ID is 18 Record job information to /tmp/singa-log/job-info/job-18-20160408-113927 Executing : ./singa -test -singa_conf /home/abhyankar/incubator-singa/conf/singa.conf -singa_job 18 -conf /home/abhyankar/incubator-singa/examples/char-rnn/job.conf E0408 11:39:27.331846 6093 cluster.cc:50] proc #0 -> localhost (pid = 6093) E0408 11:39:27.362449 6093 worker.cc:465] accuracy = nan, Loss = nan,

$ ./bin/singa-run.sh -conf examples/char-rnn/job.conf Unique JOB_ID is 3 Record job information to /tmp/singa-log/job-info/job-3-20160404-225756 Executing : ./singa [-resume] -singa_conf /home/abhyankar/incubator-singa/conf/singa.conf -singa_job 3 -conf /home/abhyankar/incubator-singa/examples/char-rnn/job.conf E0404 22:57:56.371260 6570 cluster.cc:50] proc #0 -> 0.0.0.0:49152 (pid = 6570) E0404 22:57:56.398120 6592 server.cc:62] Server (group = 0, id = 0) start E0404 22:57:56.398223 6593 worker.cc:68] Worker (group = 0, id = 0) start on GPU 0 E0404 22:57:58.417470 6593 char_rnn.cc:52] Vocab_size = 81 E0404 22:57:58.417582 6593 char_rnn.cc:72] Max iteration per epoch = 1 F0404 22:57:58.418169 6593 math_blob.h:730] Not implemented *** Check failure stack trace: *** @ 0x7f95f63377fd google::LogMessage::Fail() @ 0x7f95f633947d google::LogMessage::SendToLog() @ 0x7f95f63373e3 google::LogMessage::Flush() @ 0x7f95f6339eae google::LogMessageFatal::~LogMessageFatal() @ 0x7f95f6b77b30 singa::BPTTWorker::Forward() @ 0x7f95f6b6fa0a singa::BPWorker::TrainOneBatch() @ 0x7f95f6b79e29 singa::Worker::Run() @ 0x7f95f55d8f30 (unknown) @ 0x7f95f4df160a start_thread @ 0x7f95f4b2ba4d __clone ./bin/singa-run.sh: line 109: 6570 Aborted (core dumped) $singa_run

Changed Code: $ ./bin/singa-run.sh -conf examples/char-rnn/job.conf -test Unique JOB_ID is 19 Record job information to /tmp/singa-log/job-info/job-19-20160408-114146 Executing : ./singa -test -singa_conf /home/abhyankar/incubator-singa/conf/singa.conf -singa_job 19 -conf /home/abhyankar/incubator-singa/examples/char-rnn/job.conf E0408 11:41:46.785352 6237 cluster.cc:50] proc #0 -> localhost (pid = 6237) E0408 11:41:46.809041 6237 worker.cc:465] accuracy = nan, Loss = nan,

$ ./bin/singa-run.sh -conf examples/char-rnn/job.conf Unique JOB_ID is 4 Record job information to /tmp/singa-log/job-info/job-4-20160404-225906 Executing : ./singa [-resume] -singa_conf /home/abhyankar/incubator-singa/conf/singa.conf -singa_job 4 -conf /home/abhyankar/incubator-singa/examples/char-rnn/job.conf E0404 22:59:06.511059 6839 cluster.cc:50] proc #0 -> 0.0.0.0:49152 (pid = 6839) E0404 22:59:06.537554 6861 server.cc:62] Server (group = 0, id = 0) start E0404 22:59:06.537652 6862 worker.cc:68] Worker (group = 0, id = 0) start on GPU 0 E0404 22:59:08.574076 6862 char_rnn.cc:52] Vocab_size = 81 E0404 22:59:08.574199 6862 char_rnn.cc:72] Max iteration per epoch = 1 F0404 22:59:08.574826 6862 math_blob.h:730] Not implemented *** Check failure stack trace: *** @ 0x7fade34d07fd google::LogMessage::Fail() @ 0x7fade34d247d google::LogMessage::SendToLog() @ 0x7fade34d03e3 google::LogMessage::Flush() @ 0x7fade34d2eae google::LogMessageFatal::~LogMessageFatal() @ 0x7fade3d10b30 singa::BPTTWorker::Forward() @ 0x7fade3d08a0a singa::BPWorker::TrainOneBatch() @ 0x7fade3d12e29 singa::Worker::Run() @ 0x7fade2771f30 (unknown) @ 0x7fade1f8a60a start_thread @ 0x7fade1cc4a4d __clone ./bin/singa-run.sh: line 109: 6839 Aborted (core dumped) $singa_run

@nudles This is the output. Before and after changes were made.

raunaqabhyankar avatar Apr 04 '16 17:04 raunaqabhyankar

Hi, pls compile SINGA with CUDA enabled.

./configure --enable-cuda --with-cuda=<cuda folder path>
make

If you do not have GPU (or CUDA), then comment out one line in job.conf

#gpu: 0

nudles avatar Apr 16 '16 08:04 nudles

Hey thanks for the tip! Here's the output Original Code

[abhyankar@dhcppc4 incubator-singa]$ ./bin/singa-run.sh -conf examples/char-rnn/job.conf
Unique JOB_ID is 4
Record job information to /tmp/singa-log/job-info/job-4-20160416-174208
Executing : ./singa -singa_conf /home/abhyankar/incubator-singa/conf/singa.conf -singa_job 4 -conf /home/abhyankar/incubator-singa/examples/char-rnn/job.conf
E0416 17:42:08.750080  3629 cluster.cc:50] proc #0 -> 0.0.0.0:49153 (pid = 3629)
E0416 17:42:08.776180  3651 server.cc:62] Server (group = 0, id = 0) start
E0416 17:42:08.776283  3652 worker.cc:68] Worker (group = 0, id = 0)  start on CPU
E0416 17:42:10.810894  3652 char_rnn.cc:52] Vocab_size = 81
E0416 17:42:10.811003  3652 char_rnn.cc:72] Max iteration per epoch = 1
E0416 17:42:11.357823  3652 worker.cc:465] Train @ step 0 accuracy = 0.120000, Loss = 230.064392, 
E0416 17:43:07.767719  3652 worker.cc:465] Train @ step 100 accuracy = 3.989800, Loss = 188.168106, 
E0416 17:44:03.478979  3652 worker.cc:465] Train @ step 200 accuracy = 4.135199, Loss = 183.716354, 
E0416 17:45:03.002893  3652 worker.cc:465] Train @ step 300 accuracy = 4.773601, Loss = 178.245834, 
^Z
[2]+  Stopped                 ./bin/singa-run.sh -conf examples/char-rnn/job.conf

Changed code

[abhyankar@dhcppc4 incubator-singa]$ ./bin/singa-run.sh -conf examples/char-rnn/job.conf
Unique JOB_ID is 3
Record job information to /tmp/singa-log/job-info/job-3-20160416-173813
Executing : ./singa -singa_conf /home/abhyankar/incubator-singa/conf/singa.conf -singa_job 3 -conf /home/abhyankar/incubator-singa/examples/char-rnn/job.conf
E0416 17:38:14.131456  3411 cluster.cc:50] proc #0 -> 0.0.0.0:49152 (pid = 3411)
E0416 17:38:14.147335  3433 server.cc:62] Server (group = 0, id = 0) start
E0416 17:38:14.147336  3434 worker.cc:68] Worker (group = 0, id = 0)  start on CPU
E0416 17:38:15.256013  3434 char_rnn.cc:52] Vocab_size = 81
E0416 17:38:15.265971  3434 char_rnn.cc:72] Max iteration per epoch = 1
E0416 17:38:15.834771  3434 worker.cc:465] Train @ step 0 accuracy = 0.080000, Loss = 230.700241, 
E0416 17:39:12.429210  3434 worker.cc:465] Train @ step 100 accuracy = 3.935000, Loss = 188.156631, 
E0416 17:40:08.664752  3434 worker.cc:465] Train @ step 200 accuracy = 4.251200, Loss = 183.603928, 
E0416 17:41:04.237298  3434 worker.cc:465] Train @ step 300 accuracy = 5.384400, Loss = 177.437698, 
^Z
[1]+  Stopped                 ./bin/singa-run.sh -conf examples/char-rnn/job.conf

@nudles

raunaqabhyankar avatar Apr 16 '16 12:04 raunaqabhyankar