TICC icon indicating copy to clipboard operation
TICC copied to clipboard

Future Plans: better throughput, more measurements/second

Open n8ur opened this issue 9 years ago • 15 comments

Without any optimizing (and using i64_t way more than is probably necessary, and Arduino that's really slow), the time inside the TICC measurement loop from "got data" to end of calculation is around 500us.

However, printing over USB takes 1 to 5ms. From some Googling I've learned that the following may be bottlenecks:

  1. The "Serial.print" function is very inefficient. Using "Serial.write" might be faster, but we have to send it the data character-by-character.
  2. USB packetizes data and sends variable length packets of up to 4K once per millisecond. So there's latency waiting for the next slot, and if the data crosses packets there's further delay.

I've also found or thought of a couple of possible solutions:

  1. Use Serial.write -- probably not a big win if you still have USB latency.
  2. Buffer the print requests so that we don't send on each measurement, but build up until we can send a burst. We do not need real-time data output; if a burst arrives every 10 seconds, I don't think that's a problem.
  3. Use one of the other UART ports on the Mega CPU for output. This would require an added TTL to RS-232 converter. It would get rid of USB latency, but may have other overhead or bottlenecks (I'm not clear whether only serial port 1 is hardware UART and the others may involve more bit banging).
  4. Use SPI to send data to another Arduino, which would buffer it to send out in large packets and could also do other cool things like drive an LCD front panel. I kind of like this idea as presumably the SPI would be able to move the data out pretty darn quickly, and we already have the SPI code in the program.

Thoughts?

n8ur avatar Sep 08 '16 23:09 n8ur

Hi John - let me try to answer some parts.

  1. When passed a string, Serial.print basically just calls Serial.write. There's not much difference performance-wise. Printing a float is where Serial.print is inefficient. Our code creates a string out of the uint64_t by using printf. I suspect that printf and Serial.print are roughly similarly inefficient.
  2. USB 2.0 packetizes 1024 bytes/frame. However I don't think the Arduino has any added processing as the buffer is passed to the 16U2, which queues it and completes the USB transfer independently from the 2560 processor. Any delay here is in when the characters hit the host computer (latency about which we don't care).

For the solutions:

  1. All 4 of the serial interfaces on the Mega2560 each have only a one-byte hardware buffer. The Arduino serial library puts the transmit characters into a 64-byte ring buffer then returns. The library uses interrupts to clock out the contents of that 64-byte buffer. If You try to put more than 64 bytes into the buffer, then the Serial.write (or print) call blocks, and all other processing stops until it drops to below 64 bytes in the buffer again.
  2. I don't think things improve by choosing a different serial output as the 16U2 offloads most or all of the USB work. Other ports have the drawback that the host computer would need a serial interface.

The key to speeding up the measurement cycle it to reduce the time the 2560 spends in the compute/format loop. As long as the transmit buffers stay below 64 bytes, I think the serial Tx is outside the critical path (i.e. handled by interrupts).

  1. We could use SPI to another Arduino. The question is: does that actually reduce the time in the critical processing loop? I'm not convinced that it would, but it might. If we just sent the 8-byte uint64_t to the 2nd Arduino via an SPI, and had that 2nd Arduino do all the formatting into ASCII, then it would remove some time from the critical-path loop of the 1st Arduino.

-- Tom, N5EG

Tom-McDermott avatar Sep 11 '16 03:09 Tom-McDermott

The Mega has 4 hardware UARTs that are all the same. (That's one of its big advantages over the smaller boards). The USB-to-serial conversion is handled on a separate chip that's wired up to UART0; the CPU doesn't see any of it. In fact if you want to ignore USB and use serial you can do it right now by unplugging the USB, applying power to Gnd and 5V (or Gnd and Vin), and hooking up TTL serial to pins 0 and 1. And I agree with Tom about Serial.print, it's not likely to be your bottleneck.

In fact, I wouldn't be surprised at all if the most expensive thing is simply all of the 64-bit arithmetic (especially the divisions) in print_signed_picos_as_seconds. Not sure how much improvement is available there, but I think it's at least some.

Another thing worth trying is eliminating all of the digitalRead and digitalWrite in the measurement loop. Those functions have a bunch of overhead due to checks needed for beginner-friendliness (look up the port number and mask in a table of pin numbers, make sure it's a valid pin number, see if it's doing PWM, etc.). It's possible to bypass all that for performance, and may be worth it here.

arodland avatar Feb 25 '17 20:02 arodland

Thanks, Andrew.

You're confirming what I gathered after the bit of research I was able to do.

I have two ideas that separately or together might get around the 64 bit bottleneck, but haven't had time to fully think through either of them:

  1. Don't do math or print results at 1 picosecond resolution, but at 10 ps. That reduces scaling at the small end by an order of magnitude, and shouldn't cost anything given 60-70 ps of noise. Then maybe int32 is big enough for practical purposes.

  2. Process and print the integer and fractional part of the result separately, each as a native 32 bit, e.g.,

Serial.print(integer_part);Serial.print(".");Serial.println(fractional.part)

That requires converting PICstop (count of 100us ticks) into two variables: integer seconds, and n*0.0001, then subtracting the time-of-flight variable and deal with any rollovers that result. Can all that be done with less overhead than the current approach? I have no idea.

John

On 02/25/2017 03:22 PM, Andrew Rodland wrote:

The Mega has 4 hardware UARTs that are all the same. (That's one of its big advantages over the smaller boards). The USB-to-serial conversion is handled on a separate chip that's wired up to UART0; the CPU doesn't see any of it. In fact if you want to ignore USB and use serial you can do it right now by unplugging the USB, applying power to Gnd and 5V (or Gnd and Vin), and hooking up TTL serial to pins 0 and 1. And I agree with Tom about Serial.print, it's not likely to be your bottleneck.

In fact, I wouldn't be surprised at all if the most expensive thing is simply all of the 64-bit arithmetic (especially the divisions) in print_signed_picos_as_seconds. Not sure how much improvement is available there, but I think it's at least some.

Another thing worth trying is eliminating all of the digitalRead and digitalWrite in the measurement loop. Those functions have a bunch of overhead due to checks needed for beginner-friendliness (look up the port number and mask in a table of pin numbers, make sure it's a valid pin number, see if it's doing PWM, etc.). It's possible to bypass all that for performance, and may be worth it here.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/TAPR/TICC/issues/22#issuecomment-282509781, or mute the thread https://github.com/notifications/unsubscribe-auth/ANonv3QQF4zwhPs8W9w7tCI3VkCvfVWYks5rgI2ZgaJpZM4J4jpm.

n8ur avatar Feb 26 '17 23:02 n8ur

Option 2 is very likely to help if done right. Division is pretty much the most expensive operation on a microcontroller, and it gets exponentially slower the further you go beyond the native word size (which is only 8 bits for AVR). If you can stop the values from getting so large in the first place, even at the cost of having more variables, and even at the cost of doing a division earlier on (but with smaller quantities) you can easily win.

arodland avatar Feb 27 '17 00:02 arodland

Attached is sample code for an int32-based printout routine. Not tested, but would appreciate review for dumb logic/math errors.

This moves a little more work into the ISR, but the benefit is there is no division required at all, at least to do the printing.

Note that we will still need to multiply and divide at 64 bits in the tdc read function in order to get sufficient resolution in the time-of-flight calculation.

The timestamp() function returns integer picoseconds in an int64_t because the time interval and period routines depend on timestamp math. But we're still left with the problem of printing those int64 results, so more thought is needed. But at least this is a start.

Is this on the right track?

John

On 02/26/2017 07:15 PM, Andrew Rodland wrote:

Option 2 is very likely to help if done right. Division is pretty much the most expensive operation on a microcontroller, and it gets exponentially slower the further you go beyond the native word size (which is only 8 bits for AVR). If you can stop the values from getting so large in the first place, even at the cost of having more variables, and even at the cost of doing a division earlier on (but with smaller quantities) you can easily win.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/TAPR/TICC/issues/22#issuecomment-282601115, or mute the thread https://github.com/notifications/unsubscribe-auth/ANonv6CZkBvjtFveh7951k86GH8O8afWks5rghWxgaJpZM4J4jpm.

n8ur avatar Feb 27 '17 15:02 n8ur

print.txt

n8ur avatar Feb 27 '17 15:02 n8ur

int32_t frac = PICstop_usec * 1000000; // mult by 1e6 to make picos

I think this number is off, if PICstop_usec is actually 100us ticks, you want to multiply by 1e8, not 1e6.

Either way, you need an int64_t at this point, or you will exceed the range of int32_t (+/- about 2e9). Picoseconds are tough, we can't even fit one second worth of them in a 32-bit int.

But as demonstrated by tof, the number of picos in a 100us tick does fit comfortably in an int32, and it's a nice decimal division (meaning you can be clever with print routines; instead of multiplying, adding, and printing, you can just print the two values back-to-back if you make sure to zero-pad the second one). So I would suggest dividing things up along those lines, as a struct timestamp { int32_t sec; uint16_t ticks; uint32_t ps; } (80 bits, with the decimal point located convenently between two quantities, time range over a hundred years). Adding and subtracting them would be done through (relatively cheap) library functions that add/subtract corresponding parts and deal with carry/overflow.

I admit it seems a bit complicated, but it could be done up pretty tidily, and the next simpler option that I see (that of struct { int32_t ticks; uint32_t ps; }) only covers a range of about 5 days which would become inconvenient during long measurement runs.

I'm willing to make an attempt at doing it, if you think it's something you would want to maintain. It's the sort of thing I enjoy playing with anyhow.

arodland avatar Mar 01 '17 06:03 arodland

I'd love it if you'd take a whack at that, Andrew!

And you're right about 1e8, of course.

Thanks!

John

On 3/1/2017 1:58 AM, Andrew Rodland wrote:

|int32_t frac = PICstop_usec * 1000000; // mult by 1e6 to make picos |

I think this number is off, if PICstop_usec is actually 100us ticks, you want to multiply by 1e8, not 1e6.

Either way, you need an int64_t at this point, or you will exceed the range of int32_t (+/- about 2e9). Picoseconds are tough, we can't even fit one second worth of them in a 32-bit int.

But as demonstrated by tof, the number of picos in a 100us tick /does/ fit comfortably in an int32, and it's a nice decimal division (meaning you can be clever with print routines; instead of multiplying, adding, and printing, you can just print the two values back-to-back if you make sure to zero-pad the second one). So I would suggest dividing things up along those lines, as a |struct timestamp { int32_t sec; uint16_t ticks; uint32_t ps; }| (80 bits, with the decimal point located convenently between two quantities, time range over a hundred years). Adding and subtracting them would be done through (relatively cheap) library functions that add/subtract corresponding parts and deal with carry/overflow.

I admit it seems a bit complicated, but it could be done up pretty tidily, and the next simpler option that I see (that of |struct { int32_t ticks; uint32_t ps; }|) only covers a range of about 5 days which would become inconvenient during long measurement runs.

I'm willing to make an attempt at doing it, if you think it's something you would want to maintain. It's the sort of thing I enjoy playing with anyhow.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/TAPR/TICC/issues/22#issuecomment-283261054, or mute the thread https://github.com/notifications/unsubscribe-auth/ANonv4blZ4USSlikjQsokSbOZoxpUT-vks5rhRcTgaJpZM4J4jpm.

n8ur avatar Mar 01 '17 13:03 n8ur

It's been mentioned that baud rate imposes a basic limit on throughput. I've been using 115200 as that is the maximum speed of the Arduino IDE terminal window, but the hardware apparently supports up to 2mb. I just did a test of the current code with the "DETAIL_TIMING" flag set in TICC.ino, and found that 230400 worked just fine and 576000 worked with some garbles. But neither one resulted in reduced execution time. So I think that at present we are compute, and not wire, bound.

n8ur avatar Mar 02 '17 00:03 n8ur

Andrew, not sure I replied to this, but I did do macros for LED control. Don't know if they make much difference over digitalWrite(), but I'll take what I can get.

John

On 02/25/2017 03:22 PM, Andrew Rodland wrote:

The Mega has 4 hardware UARTs that are all the same. (That's one of its big advantages over the smaller boards). The USB-to-serial conversion is handled on a separate chip that's wired up to UART0; the CPU doesn't see any of it. In fact if you want to ignore USB and use serial you can do it right now by unplugging the USB, applying power to Gnd and 5V (or Gnd and Vin), and hooking up TTL serial to pins 0 and 1. And I agree with Tom about Serial.print, it's not likely to be your bottleneck.

In fact, I wouldn't be surprised at all if the most expensive thing is simply all of the 64-bit arithmetic (especially the divisions) in print_signed_picos_as_seconds. Not sure how much improvement is available there, but I think it's at least some.

Another thing worth trying is eliminating all of the digitalRead and digitalWrite in the measurement loop. Those functions have a bunch of overhead due to checks needed for beginner-friendliness (look up the port number and mask in a table of pin numbers, make sure it's a valid pin number, see if it's doing PWM, etc.). It's possible to bypass all that for performance, and may be worth it here.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/TAPR/TICC/issues/22#issuecomment-282509781, or mute the thread https://github.com/notifications/unsubscribe-auth/ANonv3QQF4zwhPs8W9w7tCI3VkCvfVWYks5rgI2ZgaJpZM4J4jpm.

n8ur avatar Mar 03 '17 22:03 n8ur

Status update: I found some time to work on this over the weekend. I refactored timestamps the way I suggested in my earlier comment, got everything working, and ended up with execution times that were almost exactly the same as before. I'll investigate more to try to figure out where the time is really being spent.

arodland avatar Mar 07 '17 18:03 arodland

That's interesting...

I just made a change (version 20170308.1) that might help with profiling. The new version I just pushed to main adds a "Null Output" mode that goes through all the motions but doesn't format or output the result. Setting that mode, plus uncommenting DETAIL_TIMING, should provide better ability to see the speed of the loop without serial output overhead.

A very quick test indicates that with a single channel active, time to get and calculate a result is about 650ms, but the total time through the loop is about 1100ms -- almost half the time is spent not doing useful work. I haven't had a chance yet to look for where that time is being spent.

On 03/07/2017 01:07 PM, Andrew Rodland wrote:

Status update: I found some time to work on this over the weekend. I refactored timestamps the way I suggested in my earlier comment, got everything working, and ended up with execution times that were almost exactly the same as before. I'll investigate more to try to figure out where the time is really being spent.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/TAPR/TICC/issues/22#issuecomment-284806462, or mute the thread https://github.com/notifications/unsubscribe-auth/ANonv5kTQS4cmnPgi87fPfBedGWpnIFkks5rjZz1gaJpZM4J4jpm.

n8ur avatar Mar 09 '17 01:03 n8ur

Serial.print() statements in the DETAIL_TIMING debug code were swamping the actual execution time. I just pushed 20170309.1 which just prints total execution time (micros) with nothing else and that cut down (but did not completely eliminate) the overhead. That in conjunction with the "NULL" measurement mode should allow better profiling -- toggling between NULL and TIMESTAMP mode should show the result formatting/output overhead.

n8ur avatar Mar 09 '17 18:03 n8ur

That's great! I'll rebase my work on top of this and see what comes out.

arodland avatar Mar 10 '17 04:03 arodland

I suppose this is dead, it's been over a year. The changes I made worked, but contrary to expectation, made absolutely no difference to the performance of the code, and I couldn't get to the bottom of why. I'll try to find my code and push it up in case anyone wants to look into it in the future.

arodland avatar Apr 18 '18 21:04 arodland