miopen-benchmark icon indicating copy to clipboard operation
miopen-benchmark copied to clipboard

std::bad_alloc issue with Ubuntu18.04

Open pramenku opened this issue 7 years ago • 22 comments

Tried miopen-benchmark ubuntu18.04 to give a try. Building of test went fine but while running the test , got "std::bad_alloc" issue.

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

Issue is coming from miopen-benchmark's header file when it tries to construct a directory path.

Thread 1 "alexnet" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff2779801 in __GI_abort () at abort.c:79
#2  0x00007ffff2dce8fb in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff2dd4d3a in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff2dd4d95 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff2dd4fe8 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff2dfdf26 in std::__throw_bad_alloc() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x0000000000463ced in __gnu_cxx::new_allocator<std::pair<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int> >::allocate
    (__n=12297829382473034424, this=<optimized out>) at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/ext/new_allocator.h:102
#8  std::allocator_traits<std::allocator<std::pair<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int> > >::allocate (
    __n=12297829382473034424, __a=...) at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/alloc_traits.h:436
#9  std::_Vector_base<std::pair<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int>, std::allocator<std::pair<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int> > >::_M_allocate (__n=12297829382473034424, this=<optimized out>)
    at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/stl_vector.h:172
#10 std::_Vector_base<std::pair<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int>, std::allocator<std::pair<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int> > >::_M_create_storage (__n=12297829382473034424, this=<optimized out>)
    at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/stl_vector.h:187
#11 std::_Vector_base<std::pair<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int>, std::allocator<std::pair<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int> > >::_Vector_base (__n=12297829382473034424, this=<optimized out>, __a=...)
    at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/stl_vector.h:138
#12 std::vector<std::pair<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int>, std::allocator<std::pair<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int> > >::vector (__n=12297829382473034424, this=<optimized out>, __a=...)
    at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/stl_vector.h:284
#13 std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_Executor (__begin=..., __end=..., 
    __results=..., __re=..., __flags=<optimized out>, this=<optimized out>) at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/regex_executor.h:79
#14 std::__detail::__regex_algo_impl<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, char, std::__cxx11::regex_traits<char>, (std::__detail::_RegexExecutorPolicy)0, true> (__s=118 'v', __e=0 '\000', __m=..., __re=..., __flags=(unknown: 0)) at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/regex.tcc:78
#15 0x00000000004589d7 in std::regex_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, char, std::__cxx11::regex_traits<char> > (
    __s=<error reading variable: Cannot access memory at address 0x2>, __e=0 '\000', __m=..., __re=..., __flags=(unknown: 0))
    at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/regex.h:1995
#16 std::regex_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, char, std::__cxx11::regex_traits<char> > (
    __first=<error reading variable: Cannot access memory at address 0x2>, __last=0 '\000', __re=..., __flags=(unknown: 0))
    at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/regex.h:2022
#17 std::regex_match<std::char_traits<char>, std::allocator<char>, char, std::__cxx11::regex_traits<char> > (__re=..., __flags=(unknown: 0), __s=...)
    at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/regex.h:2127
#18 ls_dir (dname=..., match=...) at ./miopen.hpp:142
#19 0x000000000045b41f in Device::init_sys_paths (this=0xc8f270) at ./miopen.hpp:175
#20 0x00000000004590c3 in Devices::init_devices () at ./miopen.hpp:289
#21 0x000000000045b2b6 in device_init () at ./miopen.hpp:299
#22 main (argc=2, argv=0x7fffffffdb20) at alexnet.cpp:60

This may be relevant: https://stackoverflow.com/questions/36106154/how-to-handle-or-avoid-exceptions-from-c11-regex-matching-functions-28-11

Thanks

pramenku avatar Jul 18 '18 05:07 pramenku

I've had this same error before too. I assumed it was just something odd with my setup (I'm using RHEL7, which is not as well supported), but it appears not. The issue is with the regex commands. Specifically, it seems different OS's (or versions of an OS) implement the regex support differently. For example, I changed this line:

https://github.com/patflick/miopen-benchmark/blob/master/miopen.hpp#L175

to use this:

std::regex("card(\\d)+")

(I also applied similar fixes elsewhere in miopen.hpp, but this seems to be the line you're having a problem with)

Not sure if the same fix works for you?

Hope this helps, Matt

mattsinc avatar Jul 18 '18 05:07 mattsinc

Hi Matt, miopen.hpp has already have that at L175

But, we are still seeing the issue with regex issue. It's doesn't help in Ubuntu18.04.

Thanks,

pramenku avatar Jul 19 '18 04:07 pramenku

Just in case we're miscommunicating, the change I'm proposing for line 175 is small and based on your response it seems like you may have thought it was identical to what is already there. Currently it is:

std::regex("card\\d+")

I changed it to this:

std::regex("card(\\d)+")

(I just put parentheses around the \\d ... I found this was necessary on RHEL7)

Matt

mattsinc avatar Jul 19 '18 05:07 mattsinc

I did the same again as you said but still it's not working.

for (std::string cardname : ls_dir("/sys/class/drm", std::regex("card(\\d)+")))

"\" is not coming after comment posted but I used "\" only.

pramenku avatar Jul 19 '18 12:07 pramenku

Yeah, the "" doesn't show up unless you use the code feature (put "`" around the code part).

Sorry my fix didn't work for you. I will say that I played around with a bunch of the C++ regex options before settling on that. Perhaps one of the others will solve your problem?

Also, when I was making those changes, I broke apart line 175 so I could run with gdb and figure out exactly what is failing. I suggest you do the same.

EDIT: One last thing: did you use 1 or 2 backslashes in the above? I used 2, but it seems like you may have used one based on what you said.

Hope this helps, Matt

mattsinc avatar Jul 19 '18 16:07 mattsinc

My apologies for the late reply.

Your stack trace points to something very odd. Somewhere in regex_match, it tries to allocate a std::vector of size 12297829382473034424. I can't reproduce and neither imagine why this would happen. Putting parenthesis around the \\d should not change the regex match, it just creates a capture for the card number. Also then, technically it should be "card(\\d+)" (note the + inside the capture).

Could you try to print out the fname's just prior to the regex match in https://github.com/patflick/miopen-benchmark/blob/master/miopen.hpp#L142 (insert a INFO(fname); prior to that line), and see at which file name it fails with the regex error?

patflick avatar Jul 19 '18 17:07 patflick

Thanks Patflick. Sorry for delay response. I tried as suggested and got below:

$ ./layerwise
[INFO]  Number of HIP devices found: 2
[INFO]  card1-DP-6
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)
 while ((entry = readdir(dir)) != NULL) {
        std::string fname(entry->d_name);
        if (fname != "." && fname != "..") {
             `INFO(fname);`
            if (std::regex_match(fname, match)) {
                files.push_back(fname);

 void init_sys_paths() {
        bool found = false;
       ` for (std::string cardname : ls_dir("/sys/class/drm", std::regex("card(\\d+)"))) {`
            std::string carddir = "/sys/class/drm/" + cardname;
            std::string fname = carddir + "/device/uevent";

pramenku avatar Jul 23 '18 10:07 pramenku

They seem to have changed the directory structure / folder naming scheme for the sysfs driver api.

I still don't know why the regex would segfault, but the regex card(\\d+) won't match the folder name card1-DP-6. When I first coded this, the cards/devices where named card0, card1,.. etc.

To check the folder structure, can you run tree in /sys/class/drm/ ?

Also, try changing the regex from card(\\d+) to card(\\d+).*. That should allow the code to match the card1-DP-6 folder.

patflick avatar Jul 27 '18 04:07 patflick

Thanks. Tried but no luck still.

$ ./alexnet
[INFO]  Number of HIP devices found: 1
terminate called after throwing an instance of 'std::regex_error'
  what():  regex_error
Aborted (core dumped)
for (std::string cardname : ls_dir("/sys/class/drm", std::regex("card\\d+).*."))) {
 if (fname != "." && fname != "..") {
                INFO(fname);
            if (std::regex_match(fname, match)) {
                files.push_back(fname);
/sys/class/drm$ ls -lrt
total 0
-r--r--r-- 1 root root 4096 Jul 30 12:43 version
lrwxrwxrwx 1 root root    0 Jul 30 12:43 ttm -> ../../devices/virtual/drm/ttm
lrwxrwxrwx 1 root root    0 Jul 30 12:43 renderD128 -> ../../devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/drm/renderD128
lrwxrwxrwx 1 root root    0 Jul 30 12:43 card0-HDMI-A-1 -> ../../devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/drm/card0/card0-HDMI-A-1
lrwxrwxrwx 1 root root    0 Jul 30 12:43 card0-DP-3 -> ../../devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/drm/card0/card0-DP-3
lrwxrwxrwx 1 root root    0 Jul 30 12:43 card0-DP-2 -> ../../devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/drm/card0/card0-DP-2
lrwxrwxrwx 1 root root    0 Jul 30 12:43 card0-DP-1 -> ../../devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/drm/card0/card0-DP-1
lrwxrwxrwx 1 root root    0 Jul 30 12:43 card0 -> ../../devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/drm/card0

Did you guys tried anytime on Ubuntu18.04?

pramenku avatar Jul 30 '18 09:07 pramenku

Any more suggestion or is someone looking into it. Is it possible just to not use regex and instead roll your own parser?

pramenku avatar Aug 14 '18 08:08 pramenku

Hi patflick , Can you please resolve this issue. It's pending from long time. All end users are seeing this issue on Ubuntu18.04. Thanks for the help.

pramenku avatar Sep 17 '18 10:09 pramenku

Hi @pramenku

I could never reproduce your error.

I just pushed a code change that might help, although I'm really just guessing. If this doesn't work, your best bet is to try to debug this yourself. Sorry

patflick avatar Sep 17 '18 15:09 patflick

Thanks @patflick I will try and update you. Meantime can you please merge https://github.com/patflick/miopen-benchmark/pull/12 PR.

pramenku avatar Sep 25 '18 04:09 pramenku

Hi @pramenku . I merged the PR. Did you get a chance to try the potential fix?

patflick avatar Sep 25 '18 16:09 patflick

Sorry @patflick for delay. Really I was too much occupied with priority tasks. Without fail, I will update you by tomorrow.

pramenku avatar Sep 26 '18 14:09 pramenku

Hi @patflick I tried with latest changes also, issue is observed. With rocm release 1.9 and Ubuntu 18.04, we are clearly seeing the issue. I am surprised how no one is seeing this issue. Ideally speaking, everyone should see this issue.

pramenku avatar Sep 27 '18 09:09 pramenku

Hi @pramenku,

I think I understand your problem now. Unfortunately I don't know if there is a happy solution. To the best of my knowledge, ROCm does not yet support Ubuntu 18.04. I believe the specific problem you are encountering is that 18.04 has gcc/g++ 7.2 as the "default" gcc/g++, but ROCm needs gcc/g++ 5.4. Have you tried installing 5.4 locally and pointing to that instead?

Matt

mattsinc avatar Sep 27 '18 14:09 mattsinc

Thanks @mattsinc. You got exactly what I want to convey. There is no issue on Ubuntu 16.04 which has gcc 5.4. Issue with Ubuntu 18.04 which has gcc 7.2.

With ROCm release 1.9 , Ubuntu 18.04 is also supported. Please check https://github.com/RadeonOpenCompute/ROCm/blob/master/README.md.

So, anyone is trying 18.04 on ROCm 1.9, they will see this issue.

pramenku avatar Sep 28 '18 16:09 pramenku

@pramenku, I did not realize that ROCm 1.9 had that support. If you use gcc 5.4 with ROCm 1.9 does it work? If so, I would guess the problem is a ROCm problem?

Matt

mattsinc avatar Sep 28 '18 19:09 mattsinc

@mattsinc It's not issue with ROCm 1.9. It's just that source code of the app needs modification as per Ubuntu 18.04 which has gcc 7.2. I am not sure cuda support Ubuntu 18.04 as I am suspecting it will come there too. I request someone to try and debug what needs to be changed as per Ubuntu 18.04.

pramenku avatar Sep 29 '18 10:09 pramenku

Hi @patflick and all

It's about 2 months since last discuss, but I encountered the same issue with tip code. My env is ubuntu 16.04 + manually installed gcc-7.3.0.

To narrow down, I write a very simple example:

#include <regex>
#include <string>
#include <iostream>

int main(){
	std::string fname("amdgpu");
	std::regex card_re("card\\d+");
	bool result = std::regex_match(fname, card_re);

	std::cout<<result<<std::endl;
}

name the above code in "main.cc", I did several test:

  1. use command: /opt/rocm/hip/bin/hipcc main.cc and run a.out, no problem
  2. use command: /opt/rocm/hip/bin/hipcc -O3 main.cc and run a.out, std::bad_alloc happen.
  3. use gcc-7.3.0 to compile, no problem for both -O3 or default.

To be concrete, I can reproduce this regex issue on hipcc with -O3 flag.

So, I'm curious if hipcc compiler have compatibility issue with gcc-7.3.0, or maybe @pramenku can help test on ubuntu 18.04 environment?

below is my hipcc info (/opt/rocm/bin/hipcc --version)

HIP version: 1.5.18442
HCC clang version 7.0.0 (ssh://gerritgit/compute/ec/hcc-tot/clang 4ed1d60af7c26e833d6d4452ba526d2daaa6ed35) (ssh://gerritgit/compute/ec/hcc-tot/llvm c57b310200941724972aa5c5c90cbc151d1978f4) (based on HCC 1.2.18451-82f39f1-4ed1d60-c57b310 )
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/hcc/bin

carlushuang avatar Nov 27 '18 06:11 carlushuang

https://github.com/patflick/miopen-benchmark/pull/13

@patflick Hi I make a quick work around for this issue, that not use regex to check the device name. If it's not acceptable just drop it.

carlushuang avatar Nov 27 '18 07:11 carlushuang