carla icon indicating copy to clipboard operation
carla copied to clipboard

Carla segfaults within 127 Episodes

Open hh0rva1h opened this issue 5 years ago • 22 comments

Environment

The following has been tested in Ubuntu with Carla 0.9.9.4, started using CarlaUE4.sh -opengl -quality-level=Low:

Trigger Code

Consider the following proof of concept code (does not logically make a lot of sense, since it was stripped down from an old version of our reinforcement learning infrastructure to be a minimal reproducer that makes Carla misbehave and crash reliably):

import carla

def update_data(self, data):
    self.data = data

def spawn_and_destroy():
    client = carla.Client("localhost", 2000)
    client.set_timeout(3)
    world = client.load_world("Town04")
    blueprints = world.get_blueprint_library()

    settings = world.get_settings()
    settings.fixed_delta_seconds = 0.05
    settings.synchronous_mode = True
    settings.no_rendering_mode = True
    world.apply_settings(settings)

    car = blueprints.find("vehicle.tesla.model3")
    position = carla.Transform(carla.Location(x=300, y=13.5, z=2), carla.Rotation(yaw=180))
    vehicle = world.spawn_actor(car, position)
    collision_actor = world.spawn_actor(blueprints.find("sensor.other.collision"),
                                                       carla.Transform(carla.Location()),
                                                       attach_to=vehicle)
    collision_actor.listen(update_data)
    lane_actor = world.spawn_actor(blueprints.find("sensor.other.lane_invasion"),
                                                  carla.Transform(),  # carla.Location()),
                                                  attach_to=vehicle)
    vehicle.apply_control(carla.VehicleControl(hand_brake=1))
    lane_actor.listen(update_data)

    for i in range(10):
        try:
            world.tick(1)
        except:
            print("WARNING: tick timed out, continuing ...")

    vehicle.destroy()
    collision_actor.destroy()
    lane_actor.destroy()

for i in range(150):
    print("Episode", i)
    spawn_and_destroy()

Starting Carla and then executing this file reliably produces problems within 127 Episodes here, though the behavior is not always exactly the same, see the following sections.

Crash variant 1

After 127 Episodes the script ends:

Episode 127
Traceback (most recent call last):
  File "carla-segfault-min.py", line 87, in <module>
    spawn_and_destroy()
  File "carla-segfault-min.py", line 51, in spawn_and_destroy
    client = carla.Client("localhost", 2000)
RuntimeError: resolve: Device or resource busy

and Carla crashes with the following message:

4.24.3-0+++UE4+Release-4.24 518 0
Disabling core dumps.
Signal 11 caught.
Malloc Size=65538 LargeMemoryPoolOffset=65554 
CommonUnixCrashHandler: Signal=11
Malloc Size=65535 LargeMemoryPoolOffset=131119 
Malloc Size=140864 LargeMemoryPoolOffset=272000 
Engine crash handling finished; re-raising signal 11 for the default handler. Good bye.
Segmentation fault (core dumped)```

Crash variant 2

After Episode 127 the script stops:

Episode 127
Traceback (most recent call last):
  File "carla-segfault-min.py", line 55, in <module>
    spawn_and_destroy()
  File "carla-segfault-min.py", line 19, in spawn_and_destroy
    client = carla.Client("localhost", 2000)
RuntimeError: resolve: Device or resource busy

and Carla crashes with the following message:

4.24.3-0+++UE4+Release-4.24 518 0
Disabling core dumps.
LowLevelFatalError [File:Unknown] [Line: 102] 
Exception thrown: close: Bad file descriptor
Signal 11 caught.
Malloc Size=65538 LargeMemoryPoolOffset=65554 
CommonUnixCrashHandler: Signal=11
Malloc Size=65535 LargeMemoryPoolOffset=131119 
Malloc Size=140864 LargeMemoryPoolOffset=272000 
Engine crash handling finished; re-raising signal 11 for the default handler. Good bye.
Segmentation fault (core dumped)

Crash variant 3

I tested the above script multiple times and at one occurrence Carla already crashed within 27 episodes:

Episode 27
Traceback (most recent call last):
  File "carla-segfault-min.py", line 87, in <module>
    spawn_and_destroy()
  File "carla-segfault-min.py", line 53, in spawn_and_destroy
    world = client.load_world("Town04")
RuntimeError: failed to connect to newly created map

with the following crash message:

4.24.3-0+++UE4+Release-4.24 518 0
Disabling core dumps.
Signal 11 caught.
Malloc Size=65538 LargeMemoryPoolOffset=65554 
terminating with uncaught exception of type std::__1::bad_weak_ptr: bad_weak_ptrCommonUnixCrashHandler: Signal=11

Signal 6 caught.
Malloc Size=65535 LargeMemoryPoolOffset=131119 
Malloc Size=140864 LargeMemoryPoolOffset=272000 
Engine crash handling finished; re-raising signal 11 for the default handler. Good bye.
Segmentation fault (core dumped)

Crash Variant 4

Episode 38
Traceback (most recent call last):
  File "carla-segfault-min.py", line 55, in <module>
    spawn_and_destroy()
  File "carla-segfault-min.py", line 21, in spawn_and_destroy
    world = client.load_world("Town04")
RuntimeError: failed to connect to newly created map
4.24.3-0+++UE4+Release-4.24 518 0
Disabling core dumps.
Signal 11 caught.
Malloc Size=65538 LargeMemoryPoolOffset=65554 
CommonUnixCrashHandler: Signal=11
Malloc Size=65535 LargeMemoryPoolOffset=131119 
Malloc Size=140864 LargeMemoryPoolOffset=272000 
Engine crash handling finished; re-raising signal 11 for the default handler. Good bye.
Segmentation fault (core dumped)

hh0rva1h avatar Aug 17 '20 20:08 hh0rva1h

related to #3211?

marcgpuig avatar Aug 21 '20 14:08 marcgpuig

@marcgpuig I can't really test since Vulkan backend does not work for me on two different machines (my Ivybridge workstation being too old and my Intel Laptop (Dell XPS 13 9360, Ubuntu) just freezes with the Vulkan backend).

hh0rva1h avatar Aug 21 '20 15:08 hh0rva1h

I just tested it multiple times with and without -opengl and could not see any difference in behavior. I consistently get the crash in "episode" 127. (only difference for me seems to be that sometimes with -opengl carla seems to still run, only the python client seems to have crashed, but I am not sure if that was just coincidental, as it only occurred twice in all tests) But interestingly I seem to get a different error message which seems to imply that maybe the map was loaded too often? But since it also consistently occurs at "episode" 127 I think it will probably still be the same type of crash?

Output: [...] Episode 127 Traceback (most recent call last): File "carla-segfault.py", line 55, in spawn_and_destroy() File "carla-segfault.py", line 21, in spawn_and_destroy world = client.load_world("Town04") RuntimeError: epoll: Too many open files Error in sys.excepthook: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/apport_python_hook.py", line 72, in apport_excepthook from apport.fileutils import likely_packaged, get_recent_crashes File "/usr/lib/python3/dist-packages/apport/init.py", line 5, in from apport.report import Report File "/usr/lib/python3/dist-packages/apport/report.py", line 30, in import apport.fileutils File "/usr/lib/python3/dist-packages/apport/fileutils.py", line 23, in from apport.packaging_impl import impl as packaging File "/usr/lib/python3/dist-packages/apport/packaging_impl.py", line 23, in import apt File "/usr/lib/python3/dist-packages/apt/init.py", line 35, in apt_pkg.init_system() apt_pkg.Error: E:Error reading the Tuple table

Original exception was: Traceback (most recent call last): File "carla-segfault.py", line 55, in spawn_and_destroy() File "carla-segfault.py", line 21, in spawn_and_destroy world = client.load_world("Town04") RuntimeError: epoll: Too many open files

syveqc avatar Aug 22 '20 18:08 syveqc

@doterop @bernatx @marcgpuig this issue just got the category of "critical". Please, let's dig in.

germanros1987 avatar Aug 29 '20 01:08 germanros1987

Still an issue with the nightly build.

hh0rva1h avatar Oct 11 '20 15:10 hh0rva1h

I encountered the same issue in 0.9.10.1, but this issue doesn't show in 0.9.9.4.

gaoyinfeng avatar Nov 16 '20 09:11 gaoyinfeng

Any solutions or walkarounds?

yasser-h-khalil avatar Nov 24 '20 00:11 yasser-h-khalil

Our tests suggested that the only issue is the reloading of the map, so we just do not reload it, we just reset our scenarios in the current world. I do not know however if that is feasible for you!

syveqc avatar Nov 24 '20 08:11 syveqc

We are seeing Crash variant 3 seemingly randomly in 0.9.8+OpenGL, especially on slower systems... we are thinking it is a race condition.

qhaas avatar Mar 28 '21 23:03 qhaas

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 11 '21 02:06 stale[bot]

Also encountered this issue on CARLA 0.9.11, any update here?

AIasd avatar Aug 09 '21 02:08 AIasd

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 09 '22 10:01 stale[bot]

Same with 0.9.13. Any updates please?

varunjammula avatar Jan 30 '22 00:01 varunjammula

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 16 '22 12:04 stale[bot]

@bernatx @doterop This has been labelled critical for quite some time. If this would not repeatedly be bumped stale bot would just close this and none of the dev's would even care since it is closed? This does not feel right tbh. Does the project need more finical support or is it different priorities?

hh0rva1h avatar Apr 29 '22 09:04 hh0rva1h

I might have a workaround from client side to avoid crashing the server. I'm using Carla 0.9.13.

I started using carla by following client implementation in https://github.com/cjy1992/gym-carla/blob/master/gym_carla/envs/carla_env.py.

As the code evolves, I recently had made some changes to the code, and it might cause a server to crash more frequently (I can't recall which change though). I thought it maybe because changing the server between async and sync mode cause the server to crash (even though I have been using it since the beginning, and it likely that it rarely causes server to crash) because the timeout exception in client side was usually raised when calling to set the setting to async and sync mode. (I changed it inside reset function, so it was changed back and forth every episode)

So, I tried to solve it by using Carla only in sync mode and end up with the code like this.

class CarlaEnv:
    ...

    def reset(self):
        # stop sensors
        self.world.tick()

        # destroy sensors and ego vehicle
        self.client.apply_batch_sync(..., due_tick_cue=True)

        # spawn ego vehicle
        self.world.tick()

        # spawn a sensor
        self.world.tick()

        # spawn another sensor
        self.world.tick()

    def step(self, action):
        # apply action
        self.world.tick()

In summary, I called world.tick a lot because I thought this was the right way to do it. (previously, spawning process happen while server is in async mode).

Unfortunately, it turned out that it crashed even faster than before, from crash every few hundred episodes to crash every tens episodes. It happened consistently, and I noticed (from timeout exception) that it happened when calling to of the the methods that tick the world.

I tried to make Carla server crash with the code from Carla team by playing with example file manual_control.py. I spawned a new vehicle repeatedly but no matter how much I spawn a new vehicle, the Carla server just won't crash.

I noticed that in manual_control.py, it only calls world.tick in the main loop one time per loop no matter how much it spawns actor or doing anything else. So, I removed almost all of world.tick and end up with the code like this.

class CarlaEnv:
    ...

    def reset(self):
        # stop sensors
        # destroy sensors and vehicles
        self.client.apply_batch_sync(..., due_tick_cue=False)

        # spawn vehicles
        # spawn a sensor
        # spawn another sensor

        self.world.tick()

    def step(self, action):
        # apply action
        self.world.tick()

Now, I have been running simulation for hours and it's still running without crashing.

In summary, this is what I think will help avoiding crashing the server.

  • Avoid calling tick when it isn't necessary.
  • Avoid calling set setting repeatedly.
  • Avoid reloading the world. (I made this change since the beginning because I noticed that I usually got timeout exception when loading the new world whether I change map or not. I don't remember whether server crash or not though, so this may just help avoiding timeout)
  • Avoid getting something (such as blueprint) repeatedly, if it can be reused, get it just only one time and reused it. (I did this because the same reason I avoid reloading the world.).

I hope my experience can help anyone who having this problem.

witoong623 avatar Apr 29 '22 20:04 witoong623

@witoong623 Thanks, this is really good advice and matches with our experience (we had to find similar workarounds as well), however bugs like these still get in your war occasionally, often minor changes to the working code base make the simulator crash again for no apparent reason which takes away so much time that could be spent better. I really do hope the Carla team soon dedicates some time ironing out crashes like these, I think this would generally be highly appreciated and would well justify waiting for other features a bit longer.

hh0rva1h avatar Apr 30 '22 11:04 hh0rva1h

Hi, I'm looking into it. I have detected two problems so far.

  1. When you delete a sensor, the streams are not destroyed, so the resources of the system for those sensor streams keep in use. That can cause that at some point the system has many file descriptors opened (sockets...).
  2. When you disconnect the client the stream of the world sensor keeps in use, and the next time you connect the client again, a new connection is done to that stream. That means that for each connection/disconnection the stream has a new connection, taking resources because the previous one was not closed.

At some point the operating system will not have more resources to use and it will give an error.

I will create a PR with the fixes for the point 1 as soon I can, but I'm still checking the point 2.

bernatx avatar May 03 '22 14:05 bernatx

@bernatx Great, I think this should handle https://github.com/carla-simulator/carla/issues/3994 as well. It would be great if you could check whether this also addresses https://github.com/carla-simulator/carla/issues/4861, https://github.com/carla-simulator/carla/issues/4935 and https://github.com/carla-simulator/carla/issues/3109?

hh0rva1h avatar May 04 '22 08:05 hh0rva1h

@bernatx Could you please link to PR as soon as it's ready?

hh0rva1h avatar Jul 05 '22 12:07 hh0rva1h

+1 need this fix ASAP, ty!

varunjammula avatar Jul 22 '22 18:07 varunjammula

Hi, we are releasing a new version next week, and this will include the fix for the 127 episodes problem. Here is the pending PR

https://github.com/carla-simulator/carla/pull/5611

bernatx avatar Jul 22 '22 18:07 bernatx

Hi @bernatx I pulled in the fix and built CARLA. There still seems to be a problem with TrafficManager. When a new client connection is started, with a trafficmanager on different port, the connection seems to hang and I observe a TimeOutException. Have you experienced any issues like this?

varunjammula avatar Oct 06 '22 07:10 varunjammula

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 02 '23 01:06 stale[bot]