smaug icon indicating copy to clipboard operation
smaug copied to clipboard

Simulation with ResNet fails

Open daecheolyou opened this issue 4 years ago • 7 comments

During simulation with ResNet, a segmentation fault occurs at gem5. I created ResNet pb and pbtxt file by running smaug/experiments/models/imagenet-resnet/resnet_network.py All configuration files are the same with minerva example, but only model_files was modfied so that it points to generated pb and pbtxt file. Input trace was generated by running trace.sh

Below is the stdout log at the end.

Scheduling data (Data). Scheduling data_1 (Data). Scheduling data_10 (Data). Scheduling data_100 (Data). Scheduling data_101 (Data). Scheduling data_102 (Data). Scheduling data_103 (Data). Scheduling data_104 (Data). Scheduling data_105 (Data). Scheduling data_106 (Data). Scheduling data_107 (Data). Scheduling data_108 (Data). Scheduling data_109 (Data).

stderr log before the backtrace shows the following message.

gem5 has encountered a segmentation fault!

Please, let me know if I configured something wrong. Thanks.

daecheolyou avatar Oct 20 '21 08:10 daecheolyou

Yuan, can you take a look at this?

xyzsam avatar Oct 21 '21 03:10 xyzsam

Yes, will take a look this week.

yaoyuannnn avatar Oct 21 '21 03:10 yaoyuannnn

Just a guess, did you update trace_file_name in gem5.cfg to use the correct trace file?

yaoyuannnn avatar Oct 26 '21 04:10 yaoyuannnn

It doesn't need to be modified, but model_files was modified so that it points to pbtxt and pb file under imagenet-resnet. Trace file was generated with trace.sh, whose input is model_files and output file name is always dynamic_trace_acc0.gz.

daecheolyou avatar Oct 26 '21 04:10 daecheolyou

I just tried running resnet50, while it's still running but it has started running the accelerator for the first convolution layer (conv0), which clearly passed the point where your simulation crashed. In order to reduce the trace size for this relatively large network, the only different I made was using --sample-level=very_high in trash.sh (the same in run.sh). And other than updating the protobuf inputs, the rest of the configuration files are the same as the ones in sims/smv/tests/minerva.

yaoyuannnn avatar Oct 26 '21 05:10 yaoyuannnn

Did the simulator leave any stacktraces indicating where the segfault occurred?

xyzsam avatar Oct 26 '21 07:10 xyzsam

Below is the stack trace for the simulation failure. I ran simulation several times with resnet, and sometimes it reached further than the log I originally posted. For example, it has reached until Scheduling relu2_b (ReLU). However, it encountered a segmentaion fault eventually with the same kind of stack trace below.

/workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_Z15print_backtracev+0x2c)[0x55a3fb5e722c] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(+0x6e92ff)[0x55a3fb5f92ff] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f8073fc9890] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0xcf)[0x7f80725f6d9f] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_ZN6X86ISA7Decoder10decodeInstENS_11ExtMachInstE+0x2e6c1)[0x55a3fc00f141] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_ZN6X86ISA7Decoder6decodeENS_11ExtMachInstEm+0x244)[0x55a3fbfa88f4] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_ZN6X86ISA7Decoder6decodeERNS_7PCStateE+0x22b)[0x55a3fbfa8beb] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_ZN12DefaultFetchI9O3CPUImplE5fetchERb+0x979)[0x55a3fbb0eb69] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_ZN12DefaultFetchI9O3CPUImplE4tickEv+0xd3)[0x55a3fbb0fe23] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_ZN9FullO3CPUI9O3CPUImplE4tickEv+0x12b)[0x55a3fbaedb3b] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_ZN10EventQueue10serviceOneEv+0xd9)[0x55a3fb5ef709] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_Z9doSimLoopP10EventQueue+0x148)[0x55a3fb610e28] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_Z8simulatem+0xcba)[0x55a3fb611dda] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(+0x7bf6d1)[0x55a3fb6cf6d1] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(+0x5e8754)[0x55a3fb4f8754] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x64d7)[0x7f8074276c47] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f80743b5908] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5bf6)[0x7f8074276366] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f80743b5908] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5bf6)[0x7f8074276366] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f80743b5908] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5bf6)[0x7f8074276366] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f80743b5908] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCode+0x19)[0x7f80742705d9] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x6ac0)[0x7f8074277230] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f80743b5908] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5bf6)[0x7f8074276366] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f80743b5908] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCode+0x19)[0x7f80742705d9] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyRun_StringFlags+0x76)[0x7f80743206f6] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_Z6m5MainiPPc+0x83)[0x55a3fb5f8013] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(main+0x38)[0x55a3fb448e08]

daecheolyou avatar Oct 26 '21 07:10 daecheolyou