Latency - a real issue in the wild
I wonder how (whether) is badger optimized for latency (scan rate, debouncing, ...) as per https://danluu.com/keyboard-latency/ ?
For less technical information see also https://bdickason.com/posts/speed-is-the-killer-feature/ .
This will of course depend tremendously on the kind of micro-controller you use, if you use i2c port expanders like I do, and possibly also on the USB implementation (I essentially just rewrote the C version in Nim). That being said the code has extremely little overhead and currently it doesn't implement debouncing (haven't had any issues with this on my board so I didn't implement it, but YMMV). I haven't done any latency measurements though, but I'm guessing it would be about as fast as the micro-controller allows it to be.
I guess I could set up a small rig to measure this though, basically just a small script on another micro-controller sending a short "keypress" pulse, then wire that up to the pins of a switch and connect a logic analyzer to the the USB of the Teensy and the pulsing controller. Should allow me to measure latency pretty easily.
I guess I could set up a small rig to measure this though, basically just a small script on another micro-controller sending a short "keypress" pulse, then wire that up to the pins of a switch and connect a logic analyzer to the the USB of the Teensy and the pulsing controller. Should allow me to measure latency pretty easily.
Wow, this would of course be awesome - especially to know what's the latency range with i2c expanders (while considering low power mode etc.). But take your time.
Even without this measurement setup I sense a lot of potential in this firmware as it's not yet bloated by the decisions how to e.g. do debouncing (debounce by waiting before sending versus debounce by first sending and then waiting before sending the depress if in other state or "not yet").
Thanks!
Did something even easier, I just hooked a logic analyzer up to the i2c and the USB data pins and saw what's going on. Basically an entire scan of the matrix takes about 40.8ms, most of this is of course i2c communications. Some of this could be trimmed away though, currently to set a pin high it must first read a register, then write back that register with one bit high. It could have some logic to avoid this, not too hard, but it was a bit outside the scope when I first wrote the keyboard (remember I streamed the whole thing, so I had to keep parts of it brief). This would be lower if you only used one port expander (e.g. with something like the Attiny85 and a port expander to just have enough ports for an entire board), and even lower if it was connected directly to the Teensy.
The USB seems to send some kind of update every 1ms, and the keys which are detected in the packet right after the scan, so at most a 1ms delay there. The matrix scans are also done back to back, I had to add a small delay in the code to even see where one ended and the next started, so those are negligible.
So all in all, if you type the first key of a matrix right after it has scanned for it (so you have to wait for that scan and the next to complete) you are looking at about 81ms latency, with best case scenario being hitting the last key right before it is scanned at which point you could easily get around 1ms latency.
Did something even easier, I just hooked a logic analyzer up to the i2c and the USB data pins and saw what's going on. Basically an entire scan of the matrix takes about 40.8ms, most of this is of course i2c communications. Some of this could be trimmed away though, currently to set a pin high it must first read a register, then write back that register with one bit high. It could have some logic to avoid this, not too hard, but it was a bit outside the scope when I first wrote the keyboard (remember I streamed the whole thing, so I had to keep parts of it brief). This would be lower if you only used one port expander (e.g. with something like the Attiny85 and a port expander to just have enough ports for an entire board), and even lower if it was connected directly to the Teensy.
My setup will need 40 (or 42) keys. So it's 6x7 . I didn't check whether Teensy 2.0 supports this much GPIO pins free for buttons (i.e. not connected to anything). But it sounds like a too high number from my experience so I believe I'll need one port expander.
The USB seems to send some kind of update every 1ms, and the keys which are detected in the packet right after the scan, so at most a 1ms delay there. The matrix scans are also done back to back, I had to add a small delay in the code to even see where one ended and the next started, so those are negligible.
This sounds a lot like polling from the bus master (your PC) which is usually around 8 or 4 or 2 ms (or even lower if your USB chip in PC is new enough to provide such speeds and is not switched to a low-power mode). This is something we can't probably influence much or at all.
On the other hand this gives me hope, that if Badger will sleep most of the time of the 1ms (which I'd consider the "practical minimum seen in the wild" despite USB 3.1 chips supporting polling rates up to 8000hz :open_mouth:) - always starting the sleep from the moment of the last received poll, then the USB connection shouldn't cease and thus we could save a lot of power (of course if not scanning the matrix at that time :wink:).
So all in all, if you type the first key of a matrix right after it has scanned for it (so you have to wait for that scan and the next to complete) you are looking at about 81ms latency, with best case scenario being hitting the last key right before it is scanned at which point you could easily get around 1ms latency.
81ms sound a bit too much for me. But as you wrote, there is quite some potential for improvement (which totally understandably wasn't a priority so far). Which is awesome!
Before we'll dive into code, do you think we could get as low as 20 ms or lower with the "double port expander setup" of yours? Or what's the low-hanging fruit minimum latency and what's the minimum potentially achievable latency with your setup?
My setup will need 40 (or 42) keys. So it's 6x7 . I didn't check whether Teensy 2.0 supports this much GPIO pins free for buttons (i.e. not connected to anything). But it sounds like a too high number from my experience so I believe I'll need one port expander.
This should work just fine with a normal matrix set up. I had one of my halves (5x7) hooked up directly to the Teensy 2.0 and it still had pins left over. With a Matrix setup you use W + H amount of pins (so 5 + 7 for me and 6 + 7 for you). 6 + 7 pins are just 13 pins, the Teensy 2.0 has 23 or something like that.
This sounds a lot like polling from the bus master (your PC) which is usually around 8 or 4 or 2 ms (or even lower if your USB chip in PC is new enough to provide such speeds and is not switched to a low-power mode). This is something we can't probably influence much or at all.
On the other hand this gives me hope, that if Badger will sleep most of the time of the 1ms (which I'd consider the "practical minimum seen in the wild" despite USB 3.1 chips supporting polling rates up to 8000hz open_mouth) - always starting the sleep from the moment of the last received poll, then the USB connection shouldn't cease and thus we could save a lot of power (of course if not scanning the matrix at that time wink).
Might be, but it would be interesting to see if we can get the Teensy to negotiate a slower mode to save on power. This 1ms addition isn't a lot when compared to the other time sinks, so it could be a good way to lower power consumption. Problem is that the matrix is currently scanned at all times. I have been toying with the idea of setting all pins to high and then just read one pin per matrix to see if any key is pressed, then scan the full matrix. During this idle period you could probably clock the CPU down and do some light-sleep thing as well. I just worry about the time to get back to full-speed once it detects one.
81ms sound a bit too much for me. But as you wrote, there is quite some potential for improvement (which totally understandably wasn't a priority so far). Which is awesome!
Before we'll dive into code, do you think we could get as low as 20 ms or lower with the "double port expander setup" of yours? Or what's the low-hanging fruit minimum latency and what's the minimum potentially achievable latency with your setup?
81ms is a bit long, I agree. I had a small poke, and since it seems that the USB sending is just done on the next clock cycle I don't think it would be any problem putting it inside the polling loop, it would require a little bit of variable overhead but would approximately half the latency. I also looked into the read/write thing I mentioned earlier, essentially to set a pin to high right now it does a read of the port, then a write to the same port with the one pin I want to change set to high. This means that setting one pin to high is three I2C calls (write the address and register you want to read, read it, write the new value). The loop could easily track this which would cut down communications by about 53%, this means that latency is again cut in about half. Implementing these two things would bring us down to ~20ms latency with two port expanders, and half again if you only used one (I could easily have my entire board on one port expander, I just use two in order to be able to have two physically split halves. Another option would be to use SPI instead of I2C which has higher clock-speeds, this would probably bring latency down further (not 100% sure though, haven't looked at the overhead of SPI).
That being said I'm now scanning the matrix in the most obvious way, set the row I want to listen to, and check every column in order. But it should be theoretically possible to do a binary search pattern. First set every row, and read the entire column port. I2C reads and writes are one byte per port, mine has two ports with 8 pins per port, so doing that would be one write to set the entire row port, one write to set the entire column port, and then a single read to check if any button is pressed. Then divvy up the board to limit the amount of read/writes we actually need to do. This should be able to bring the latency down by a lot I think.
This should work just fine with a normal matrix set up. I had one of my halves (5x7) hooked up directly to the Teensy 2.0 and it still had pins left over. With a Matrix setup you use W + H amount of pins (so 5 + 7 for me and 6 + 7 for you). 6 + 7 pins are just 13 pins, the Teensy 2.0 has 23 or something like that.
Interesting - that could make my setup even simpler. Thanks for the comparison!
Might be, but it would be interesting to see if we can get the Teensy to negotiate a slower mode to save on power. This 1ms addition isn't a lot when compared to the other time sinks, so it could be a good way to lower power consumption.
Actually 8 ms would be too much IMHO. On the other hand something quicker than 1 ms is a nonsense. So if it's possible (I have no idea but I have a slight feeling it's actually not), we should try e.g. 4 ms as an ideal candidate and then try 2 ms and 1 ms if the 4 ms won't work. Note, it needs to be power of 2 according to the USB standard if I'm not mistaken.
Problem is that the matrix is currently scanned at all times. I have been toying with the idea of setting all pins to high and then just read one pin per matrix to see if any key is pressed, then scan the full matrix. During this idle period you could probably clock the CPU down and do some light-sleep thing as well. I just worry about the time to get back to full-speed once it detects one.
This sounds plausible to me. But I never tried this (nor any similar trick) and thus wonder how much power would the high state leak in the end. Definitely worth trying and measuring.
81ms is a bit long, I agree.
Yep: https://www.youtube.com/watch?v=vOvQCPLkPt4
I had a small poke, and since it seems that the USB sending is just done on the next clock cycle I don't think it would be any problem putting it inside the polling loop, it would require a little bit of variable overhead but would approximately half the latency. I also looked into the read/write thing I mentioned earlier, essentially to set a pin to high right now it does a read of the port, then a write to the same port with the one pin I want to change set to high. This means that setting one pin to high is three I2C calls (write the address and register you want to read, read it, write the new value). The loop could easily track this which would cut down communications by about 53%, this means that latency is again cut in about half. Implementing these two things would bring us down to ~20ms latency with two port expanders, and half again if you only used one (I could easily have my entire board on one port expander, I just use two in order to be able to have two physically split halves. Another option would be to use SPI instead of I2C which has higher clock-speeds, this would probably bring latency down further (not 100% sure though, haven't looked at the overhead of SPI).
That sounds awesome! I've started drawing the physical dimensions of the keyboard (it's actually quite complicated so this will take the most time IMHO). And then later I will be placing components - Teensy, buttons, etc.
That being said I'm now scanning the matrix in the most obvious way, set the row I want to listen to, and check every column in order. But it should be theoretically possible to do a binary search pattern. First set every row, and read the entire column port. I2C reads and writes are one byte per port, mine has two ports with 8 pins per port, so doing that would be one write to set the entire row port, one write to set the entire column port, and then a single read to check if any button is pressed. Then divvy up the board to limit the amount of read/writes we actually need to do. This should be able to bring the latency down by a lot I think.
Yes, that's perfect strategy actually - didn't think of this before! Good old logarithm...
This sounds plausible to me. But I never tried this (nor any similar trick) and thus wonder how much power would the high state leak in the end. Definitely worth trying and measuring.
Well it's not really high. The way it works is that columns are set as input-pullup and then the rows are set to be output-low (i.e. off), now when reading a column pin it will register as high if it doesn't have a route from the pullup to a row pin, but 0 if it has a route. This shouldn't leak as much power as setting a high state and reading it, but I'm not sure what the actual difference comes down to.
81ms is a bit long, I agree.
Yep: https://www.youtube.com/watch?v=vOvQCPLkPt4
It is a worst case scenario though, and can only happen for certain keys, the last key to be scanned has a max latency of 41ms.
That sounds awesome! I've started drawing the physical dimensions of the keyboard (it's actually quite complicated so this will take the most time IMHO). And then later I will be placing components - Teensy, buttons, etc.
Planning is half the process! I have gone through so many revisions in my head for this keyboard, finally someone at work spurred me on to just get something working, so the keyboard I have now isn't exactly what I wanted to make, but at least I have one.
Yes, that's perfect strategy actually - didn't think of this before! Good old logarithm...
While typing that out I came to another realization. I've got all my rows hooked up to one port on the expander, and all my columns hooked up to the other. The wiring is done in such a way that it shouldn't be a problem to leave all the columns as pullups and then simply read the entire column port in one read. This would save an incredible amount of communications, especially if combined with the tracking read. It would essentially cut it down to 5 writes and 5 reads per scan, compared to the 115 reads and 80 writes it has to do now. This alone would cut the latency down a whole lot, likely down into the sub 10ms range. And this is before applying the USB tracking read or the logarithmic scan! I'm currently rewriting the support libraries to be more generic so they can be used for more things, when that is done I'll try out this scheme and see what kind of latency I get.
Just checked with reading a full port and reading both port expanders now takes 8.6ms. So worst case scenario is now reduced to ~18ms.
Implemented the more efficient writes (manually for now) and a full read is now 2.8ms, so the worst case scenario is now ~7ms. Not too bad considering that it's over two port expanders. Now if I implement immediate sends it should halve that number to end at 3.5ms. The logarithmic scan might help some, but since you have to support holding down multiple keys you can easily end up with having to perform more read/writes than with the current system. But of course scanning for a single key would be faster.
A nice side effect of having to perform fewer reads is also that the code size is smaller, all these changes actually reduced code size by about 35%, down to a tiny 2458 bytes!
Wow! I didn't even finish the mechanical drawing yet (I'm about at the half of the process) and didn't start the PCB yet and you're already implementing the cool stuff!
Now you can play fast games on your split keyboard with a slight edge over the others :wink:.
Implemented the more efficient writes (manually for now) and a full read is now 2.8ms, so the worst case scenario is now ~7ms. Not too bad considering that it's over two port expanders. Now if I implement immediate sends it should halve that number to end at 3.5ms.
Worst case of 7ms is very good for me and definitely enough for my use case. 3.5ms would be "heaven" for me :wink:.
The logarithmic scan might help some, but since you have to support holding down multiple keys you can easily end up with having to perform more read/writes than with the current system. But of course scanning for a single key would be faster.
Holding multiple keys at the same time is frequent. 75% single keys, 21% double keys, 3% triple keys, 1% other - that's the expected use case of my envisioned keyboard. But it still sounds worth exploring if one has a little bit of free time (I like sublinear/logarithmic algos :smile:).
A nice side effect of having to perform fewer reads is also that the code size is smaller, all these changes actually reduced code size by about 35%, down to a tiny 2458 bytes!
That's, ehm, a whopping difference! I like how Nim can squeeze the generated code.
Btw. if you get to 3.5ms or lower, you should contract with some manufacturers (especially those having "gamer equipment" in their portfolios - which is pretty much everybody) to get some money out of this.
They'd certainly want lots of additional features (lights, settings without recompilation, support for protocol their PC software uses, support for their wireless receiver, etc.) but if you had the time to do this for them, you could really get nicely paid for that :wink:.
Disclaimer: I'm in no way affiliated with any such manufacturer.
Wow! I didn't even finish the mechanical drawing yet (I'm about at the half of the process) and didn't start the PCB yet and you're already implementing the cool stuff!
This will be part of my upcoming FOSDEM talk about Nim on microcontrollers, so I spend pretty much my entire evening programming this stuff.
Now you can play fast games on your split keyboard with a slight edge over the others wink.
I think my lack of skill in fast games would diminish any such edge :stuck_out_tongue:
Worst case of 7ms is very good for me and definitely enough for my use case. 3.5ms would be "heaven" for me wink.
Most of this is from communicating with the port expanders by the way, if you hook everything up directly to the microcontroller you'd probably get much lower latencies.
Holding multiple keys at the same time is frequent. 75% single keys, 21% double keys, 3% triple keys, 1% other - that's the expected use case of my envisioned keyboard. But it still sounds worth exploring if one has a little bit of free time (I like sublinear/logarithmic algos smile).
Of course, and the benefit will be bigger the more rows you have. On each port expander I need one read per selected row to get all the columns, and five writes to select the row. If I did a logarithmic scan (assuming a single key) I would need one write to select all rows, and one read, then one write (and a read) to select half of the rows (either 3 or 2), then another write (and a read) to select half again, and then potentially one more if it was in the three block. That's still 3-4 writes and the same amount of reads. So you save 1 or 2 reads with 1 key and 5 rows. And all of that is likely gone if you hold down two keys. If latency is your most important metric then keeping all the keys directly connected is probably your best bet (or maybe using SPI, haven't looked into that yet).
That's, ehm, a whopping difference! I like how Nim can squeeze the generated code.
Yup, that's really neat. For comparison your typical "hello world" microcontroller example, blinking the built-in LED, implemented with Arduino on the Teensy takes 2422 bytes and uses 24 bytes of global memory (this firmware uses 14). So if I only had one layer in my layout my entire firmware would probably be smaller than the Arduino blink LED code. And for completeness the LED blink example using the same libraries I use for this firmware takes 248 bytes and 0 bytes of global memory.
Btw. if you get to 3.5ms or lower, you should contract with some manufacturers (especially those having "gamer equipment" in their portfolios - which is pretty much everybody) to get some money out of this.
They'd certainly want lots of additional features (lights, settings without recompilation, support for protocol their PC software uses, support for their wireless receiver, etc.) but if you had the time to do this for them, you could really get nicely paid for that wink.
Adding all those features might start eating into the latency though :thinking:. But getting paid for working with microcontrollers would be pretty cool.
Haha, I had actually left in some unnecessary LED blinking and my debug delays (to be able to see the individual blocks of the i2c communications). Removing those and the firmware size is now 2392 with both layers. So already it's smaller than the Arduino blinking code.
This will be part of my upcoming FOSDEM talk about Nim on microcontrollers, so I spend pretty much my entire evening programming this stuff.
Will need to watch your talk from the recording (I'll certainly be busy when you get online).
If latency is your most important metric then keeping all the keys directly connected is probably your best bet (or maybe using SPI, haven't looked into that yet).
I'm not sure but would guess SPI would be more power hungry (but it could be faster, so maybe the peek consumption would be more than outweighted by the speed - IDK).
Yup, that's really neat. For comparison your typical "hello world" microcontroller example, blinking the built-in LED, implemented with Arduino on the Teensy takes 2422 bytes and uses 24 bytes of global memory (this firmware uses 14). So if I only had one layer in my layout my entire firmware would probably be smaller than the Arduino blink LED code. And for completeness the LED blink example using the same libraries I use for this firmware takes 248 bytes and 0 bytes of global memory.
:smile:
Adding all those features might start eating into the latency though thinking.
Definitely - but maybe not that much if done cleverly and not excessively... (blinking LEDs etc. are not my thing, so this is just a huge guesswork :wink:).
But getting paid for working with microcontrollers would be pretty cool.
If I'll stumble upon such opportunity, I'll let you know. Generally I think it's more about automotive nowadays, but YMMV.
Oh, now I noticed you're one of the main organizers of Nim-related stuff on FOSDEM. I really need to find some time at that weekend or soon after. Good luck with that!