fleet-telemetry icon indicating copy to clipboard operation
fleet-telemetry copied to clipboard

ZMQ server bug?

Open rawmean opened this issue 11 months ago • 5 comments

Almost once per 30 minutes, the fleet-telemetry crashes with one of the following errors.

Bad address (src/tcp.cpp:253)
double free or corruption (top)
free(): double free detected in tcache 2

The CPU usage and memory usage are both about ~15% on my machine. This problem started when I added more vehicles.

rawmean avatar Jan 12 '25 02:01 rawmean

@jordan-bonecutter would you have any guidance on how to debug this issue?

agbpatro avatar Jan 16 '25 19:01 agbpatro

@jordan-bonecutter would you have any guidance on how to debug this issue?

It seems to be a race condition that happens only at scale (ie, when the number of vehicles that stream data goes above a certain level). ZMQ was working fine until I ramped up the number of vehicles.

The error is not related to the ZMQ client because the crash happens when where the client is stopped and no one is listening to the ZMQ port.

I also monitored the memory consumption and I don't think it's a memory leak problem either because memory consumption is stable.

I finally gave up and switched to PubSub and it's working fine.

rawmean avatar Jan 16 '25 20:01 rawmean

@jordan-bonecutter would you have any guidance on how to debug this issue?

It seems to be a race condition that happens only at scale (ie, when the number of vehicles that stream data goes above a certain level). ZMQ was working fine until I ramped up the number of vehicles.

The error is not related to the ZMQ client because the crash happens when where the client is stopped and no one is listening to the ZMQ port.

I also monitored the memory consumption and I don't think it's a memory leak problem either because memory consumption is stable.

I finally gave up and switched to PubSub and it's working fine.

Memory bugs can be weird, so I wouldn't rule out ZMQ per-se. I checked out tcp.cpp in the source on the given line and it isn't terribly interesting:

#if !defined(TARGET_OS_IPHONE) || !TARGET_OS_IPHONE
        errno_assert (errno != EACCES && errno != EBADF && errno != EDESTADDRREQ
                      && errno != EFAULT && errno != EISCONN
                      && errno != EMSGSIZE && errno != ENOMEM
                      && errno != ENOTSOCK && errno != EOPNOTSUPP);

which doesn't seem to be doing any free-ing of delete-ing to me. I will spend some time on this over the weekend but I have not myself run into this. @rawmean are you using the Dockerfile to build? I wonder if you're using a different version of ZMQ where this line is more interesting.

jordan-bonecutter avatar Jan 16 '25 21:01 jordan-bonecutter

I didn't use Docker. How many vehicles did you test it with? To test the crash problem I think you need to test it with at least 2000 vehicles.

The code that you shared: is that from tcp.cpp? It's unlikely that tcp.cpp to have a bug because it's used extensively everywhere. I'm surprised that it refers to iPhone target.

rawmean avatar Jan 16 '25 21:01 rawmean

Yeah, the bug won’t be in this line but we’ll be able to see what’s being freed and that could be super useful. I no longer work for the company that was using the API but we had roughly 100 vehicles

jordan-bonecutter avatar Jan 16 '25 21:01 jordan-bonecutter