DynamixelSDK icon indicating copy to clipboard operation
DynamixelSDK copied to clipboard

C library, many flaws, unusable

Open bbulkow opened this issue 4 years ago • 9 comments

Hi, I attempted to use the C library on Linux. I did not use the C++ library, and I would first warn anyone to use the C++ library as apparently it is more used.

In the C library, I found a wide range of bugs that made it unusable, and I lost several days diagnosing and repairing. With the following fixes (I hope I remember them all), the library is now working.

  1. Improper handling of negative return code from read(). This was the worst problem. The Linux code needs the following:
int readPortLinux(int port_num, uint8_t *packet, int length)
{
  ssize_t sz = read(portData[port_num].socket_fd, packet, length);
  if (sz < 0) sz = 0;
  return(sz);
}

because the fd is opened non-blocking, and needs to be, in order to provide timeouts without using select(). Thus, if there is no data available, read() returns -1 with errno set to EAGAIN (11). This function is defined as signed, but the value is added to an unsigned value here (the function all reads use internally):

 uint16_t rx_length = 0;
....
  while (True)
  {
    rx_length += readPort(port_num, &packetData[port_num].rx_packet[rx_length], wait_length - rx_length);
    if (rx_length >= wait_length)

This results in the rx_length being very large (64k), so the subsequent code attempts to parse the buffer, which is either uninitialized, or loaded with the prior read. If one read is done correctly, subsequent reads return the same value over and over. Imagine my surprise when attempting to read the position of my servo and I found not only were the returned values out of bounds, they never changed, until - 1 out of 100 times - I happened to hit the timing right and actually read data from the serial port.

Also, the request-response order is now lost. If that is a TxRx interaction, or a status packet coming back, the next read may well receive it (the timing hole is small because there is a tcflush() but I saw it happen). For this reason - and since the code doesn't use the requestID for validation - one gets responses to the wrong requests.

As the readPort function has no way to return an error (there are other errors besides EAGAIN), the only sensible modification other than restructuring the code is to consider that all errors are a 0 length. In reality, the code should be restructured to surface other errors.

  1. Globals are not set to 0 in C. There are two underlying globals, *packetData notably, which are not initialized. In the Arduino language, variables are initialized to 0, but not in C. Instead of setting the variables the first time used, the code is attempting to realloc() (we'll get to that). The simplest fix is to set the variables to 0 at a correct time. Better code for packetHandler is simply:
  packetData = (PacketData*) malloc(g_used_port_num * sizeof(PacketData) );

because there is no need to first malloc() to a size 1 buffer then realloc to a larger buffer, and you shouldn't test the variable packetData because it wasn't initialized to 0. The code will also fail (leaking memory at the very least) if called twice, so this kind of protection against a second call is not suitable.

  1. packetHandler() must come after setting up all ports. While this is true, it seems, in many of your SDK libraries, the C example programs don't obey this rule. You need to change the examples to do the required thing, and you need to document this requirement. The fact that packetHandler is defined as 'void' means you have no way of returning and error and showing the programmer they must look at the documentation. The simplest fix is to document & change the examples, a better version would be to also enforce the restriction in the code and have packetHandler() return an error.

  2. Extraordinary use of realloc(). I can almost see what the programmer is thinking: realloc() must be the most space efficient use, and is handy because the length is maintained internally, I'm going to use it on every function call, multiple times! However, attention to two details is warranted: first, if you call realloc() with a 0 pointer, it behaves like malloc() - this is defined in the spec - which means that code which mallocs a byte because a realloc() will come along and set the buffer to the right size is unnecessary, for example. Please use the specified behavior of realloc() to simplify the code. Second, realloc() is slow. I found in my case, every function call I made (set a goal, read the current position) meant the transmit buffer was being resized between size 7 and size 8 constantly. For a device which can respond in less than 10us (30us with actual bandwidth), a single call to realloc() can bust your budget, and is unnecessary.

I would propose allocating all the buffers up front - they shouldn't even be pointers and malloc()'d, they can directly in the structure. The 4 buffers in question all have a maximum size of 1k (less depending on the servo and use pattern), and I stipulate that anyone who buys servos for this price can pay 4K of DRAM per serial port, and will gladly do so for the increase in predictability and reduced potential failure cases. Only on the smallest of arduinos is this a problem, but this code isn't for the Arduino language (see 1. above).

The reality, even on a small-memory system like an Arduino Uno, is one must be prepared to receive large packets. The Arudino memory system is even more primitive than Linux - most Linux distros have an 8 or 16 byte lower limit, which means resizing between 7 and 8 is only the cost of a system call - but in arduino, this actually creates holes in the heap ( see the advice of using Arduino's String object, which commonly creates real-world problems for programmers because it reallocs every byte ). The only better pattern for arduino would be a re-write that doesn't have 4 buffers, but has one. There is actually no good reason to have a separate packet buffer and data buffer, and no good reason to have a separate tx and rx buffer. There is a small amount of state (operation, id) to be kept between the request and response.

The final argument is that so much of this code is unable to return a sensible error if there is an allocation failure, it's best to make sure an allocation failure never happens. Up-front allocation is a good pattern for embedded code of this type, and when i did that (it took a full hour of modifying this code :-)), the code ran stable.

  1. Memory overwrite or duplicate free lurking somewhere. Reading through the code, I found a place where malloc() was used instead of realloc(), but it looked like it would only leak. I also received a large number of crashes due to "stomped" memory, and yes I am using only a single thread and aware the API is not threadsafe (see below). When I changed 4) to use static 4k buffers for performance, and also solved 1. above which could easily result in reading and writing beyond the length of the buffer, and solving the global initialization problem 2) these crashes went away. As many of the realloc() lengths are constants, it would be very easy to make a single mistake somewhere.

  2. Illusory multi-threading protection. While the code is non-thread-safe, and should be documented as such, there is frequent use of the following metaphore:

  if (g_is_using[port_num])
  {
    packetData[port_num].communication_result = COMM_PORT_BUSY;
    return;
  }
  g_is_using[port_num] = True;

This code is obviously not thread safe either, and code which provides the illusion of thread safety is only a trap for the unwary. It is better to remove the illusion, to document the code as not safe, or to make the code thread safe with any of the common standard mutex / critical section / atomic means of providing actual thread safety, than to have this type of code.

This is all I remember finding, and thought I should be a good citizen and report. Good luck to you!

bbulkow avatar Dec 30 '20 21:12 bbulkow

One addendum. Using 'select' for an embedded system, even on one file descriptor, to provide a timeout is not unreasonable. While it might seem that select() is a heavyweight operation compared to a read(), it has always been common to use single-fd selects, so the code is well tuned. I don't recommend doing that in this case, because we expect the servo to always respond in a few microseconds, but it is the other valid professional approach.

bbulkow avatar Dec 30 '20 22:12 bbulkow

Oh, I forgot another one. The API call 'action' is missing from the headers. It's a bit unclear what 'action' does - I sent a note to a forum on this. In any case, it does exist, 'action2' is defined and seems to work, but 'action1' and 'action' don't exist at all - and nothing's in any of the headers.

bbulkow avatar Dec 31 '20 01:12 bbulkow

@bbulkow Thank you very much for your thorough feedback on our SDK issue and sincerely apologize for your inconvenience using it in your project. If I may ask, could you give us more information about your development environment and hardware configuration which would be very helpful for us to review our code?

Regarding with the action, it is diverged into action1 and action2 based on the Protocol 1.0 / 2.0 just like other methods such as reboot2, readTxRx2, etc.

Action was widely used when all connected DYNAMIXEL had to be start operating at the exact same time. In order to use this, the Reg Write instruction must be used to save certain instruction in each DYNAMIXEL then Action triggers them to launch. This is described in the Protocol section of eManual below. https://emanual.robotis.com/docs/en/dxl/protocol2/#action

Happy New Year!

ROBOTIS-Will avatar Jan 04 '21 07:01 ROBOTIS-Will

@ROBOTIS-Will My development environment is Ubuntu 18.04 with GCC 10.1. I consider this the most vanilla of possible Linux environments, which is why I use it.

Thanks for your note on Action. It's now clearer that it must be used with RegWrite. Let me, then, point out that RegWrite is a part of the C API in packet_handler.h, but action() has been removed. RegWrite would seem difficult to use without Action. An example that uses only the exposed documented API calls might be suitable.

bbulkow avatar Jan 05 '21 00:01 bbulkow

@bbulkow Thank you for the information. I'll put a reminder to add the missing Action API in the header file. Thank you for pointing that out :)

ROBOTIS-Will avatar Jan 05 '21 07:01 ROBOTIS-Will

Has your investigation proven or disproven my theory that the C API is substantially broken on linux, with uninitialized variable crashes, the inability to read data reliably, and memory leaks? Or is it just me?

bbulkow avatar Jan 06 '21 19:01 bbulkow

@bbulkow Thanks for the friendly reminder. The review on this issue will take some time so please take your time and I'll get back to you when we finish the investigation. Thank you.

ROBOTIS-Will avatar Jan 07 '21 00:01 ROBOTIS-Will

@ROBOTIS-Will I have published a version of the C API for ESP32 in order to move forward with a project I'm working on. It has many of the changes made fixing the flaws I found. While it's not for linux - it's for the Espressif branch of FreeRTOS - perhaps you can take note of it.

In my measurements, about 30% of my latency in the control loop is still caused by allocating and freeing - this is in 'sync read' and 'sync write'. Depending on whether that 30% matters I'll improve the library further.

In any case, feel free to take a look at the repo, it has a variety of thoughts in the readme. https://github.com/bbulkow/Dynamixel-idf

bbulkow avatar Jan 11 '21 21:01 bbulkow

@bbulkow Thank you for sharing your repository. This will be a good example of issues when implementing DYNAMIXEL SDK on an embedded board.

ROBOTIS-Will avatar Jan 12 '21 02:01 ROBOTIS-Will