linux icon indicating copy to clipboard operation
linux copied to clipboard

SPI Driver: Long CSB-SCK delay and SCK-CSB delay leads to poor performance of SPI

Open Jephinus opened this issue 3 years ago • 27 comments

Hi, I'm when I use the spidev interface with my code. I probed the waveform for the 4 pins, CSB, SCK, SDI and SDO. However I noticed when I set the spi max speed to higher frequency SCK, but overall performance does not imporve a lot. So I went on to investigate this issue by probing the 4 pins using oscilloscope.

Surprisingly, I noticed there is a measurable time between CS selection and the first SPI clock and again a delay between the last SPI clock and CS deselected. This huge delay between CS goes LOW and first SPI clock begin is causing my overall SPI transfer to be very slow. Is there any modifications that I can do for this issue?

I really need help to optimize the SPI performance and will really appreciate any help or guidance given. Thanks in advance.

Jephinus avatar Sep 08 '21 08:09 Jephinus

I think @l1k has previously done a lot of work on SPI optimisation - I haven't - so may be able to offer some suggestions if he has the time (not being employed by Raspberry Pi), but in general I would expect a kernel device driver to have higher performance than a user-space spidev client app.

pelwell avatar Sep 08 '21 08:09 pelwell

The struct spi_ioc_transfer passed in from user space contains delay_usecs and word_delay_usecs attributes. First thing to check is whether these are non-zero. E.g. if your code is written in C and accesses the /dev/spidevB.C character device directly, be sure to zero these fields before passing the struct to the kernel.

l1k avatar Sep 08 '21 09:09 l1k

Thanks, Lukas.

pelwell avatar Sep 08 '21 10:09 pelwell

Hi,

Thanks for reply to my issue.

@l1k, I believe I've zeroed out the delay_usecs and word_delay_usecs attributes by setting spi_xfer.delay_usecs = 0; as below:

int spi_transfer(spi_t *spi, const uint8_t *txbuf, uint8_t *rxbuf, size_t len) { struct spi_ioc_transfer spi_xfer;

/* Prepare SPI transfer structure */
memset(&spi_xfer, 0, sizeof(struct spi_ioc_transfer));
spi_xfer.tx_buf = (uintptr_t)txbuf;
spi_xfer.rx_buf = (uintptr_t)rxbuf;
spi_xfer.len = len;
spi_xfer.delay_usecs = 0;
spi_xfer.speed_hz = 0;
spi_xfer.bits_per_word = 0;
spi_xfer.cs_change = 0;

/* Transfer */
if (ioctl(spi->fd, SPI_IOC_MESSAGE(1), &spi_xfer) < 1)
    return _spi_error(spi, SPI_ERROR_TRANSFER, errno, "SPI transfer");

return 0;

}


I also set below in my main.c then call the spi function above to start spi transfer:

int spidev_mode = 0; int spi_max_speed = 220000000;

char spidevice_path[] = "/dev/spidev0.0"

spi_t *spi;

spi_open(spi,spidevice_path,spidev_mode,spi_max_speed);

Jephinus avatar Sep 08 '21 11:09 Jephinus

Here I attach two waveform to better illustrate the issue that I'm facing:

Figure 1: 4 SPI pins waveform that shows there's a large csb-sck delay and sck-csb delay that affect the my spi transfer by a lot: SPI_Driver_Waveform1

Figure 2: The red arrow part is where actual spi transfer is going on, the rest are delay that I hope there's way to remove. SPI_Driver_Waveform2

Jephinus avatar Sep 08 '21 12:09 Jephinus

Is that a delay of 3 usec or 3 msec between CS assert and first clock tick? If the former, I'm afraid that may be as good as it gets with a RasPi on Linux. It might be possible to be a tiny bit faster if dispensing with the OS and running on bare metal instead.

Is the 220 MHz clock speed actually reached? Various sources claim that 125 MHz is the maximum (e.g. see here).

l1k avatar Sep 08 '21 12:09 l1k

From the waveform, it's about 3us delay between CS asserts and first edge of sck. May I know the reason for this delay? And also the delay for last SCK and CS de-assert, this delay cannot be improved as well?

Sorry that is a typo, it is 22MHz SPI clock speed set.

Jephinus avatar Sep 08 '21 12:09 Jephinus

The annotation say 2.9us, which is pretty short. With the 250MHz core clock the fastest SPI clock is 125MHz, but in practice you'd struggle to get much more than 50MHz from a GPIO pin.

pelwell avatar Sep 08 '21 12:09 pelwell

The bcm2835 SPI driver uses software CS control - the CS line is just any random GPIO - which contributes to the delay.

pelwell avatar Sep 08 '21 12:09 pelwell

I'm working on a performance-specific application that requires a high speed SPI transfer. As my implementation will be involving 1000 times of SPI read for one calculation so the 2.9us * 1000 is a relatively large delay for my implementation.

Is there any modifications I can do to the SPI driver source code to perhaps shorten down the delay?

Jephinus avatar Sep 08 '21 12:09 Jephinus

As I'm using spidev0.0 so I believe the CS is fixed to GPIO8 (CE0) based on Raspberry Pi 4 GPIO pins. Let's say the GPIO pin can switch at 50MHz, why is the large CSB delay introduced when I'm using SCK about 22MHz only.

Jephinus avatar Sep 08 '21 13:09 Jephinus

Do you need to toggle CS at all?

pelwell avatar Sep 08 '21 13:09 pelwell

Ya for my application, I need to toggle the CS for every 7 bytes transfer.

Jephinus avatar Sep 08 '21 13:09 Jephinus

As an experiment, put force_turbo=1 in config.txt and reboot.

pelwell avatar Sep 08 '21 13:09 pelwell

I've tested, almost the same, 3us

Jephinus avatar Sep 08 '21 15:09 Jephinus

I don't think you are going to improve without rewriting the driver.

I note that you've not shown two back-to-back transfers - for how long is CS high?

pelwell avatar Sep 08 '21 15:09 pelwell

You've brought up a good point. This is what I observed from the waveform, there is big delay between continuous back-to-back transfer too due to CSB high. Although I'm doing continuous back-to-back transfer without introducing any delay in between, somehow the CSB toggle high delay is alternate, with the pattern one short (high for shorter period) and one big (CS stay high for longer period). I wonder what is the reason that is causing this? Here I attached some waveforms for clearer observations.

Figure 1: Zoom out view to show the alternate one shorter CSB high, and one longer CSB high for continuous back-to-back transfer image0

Figure 2: The shorter CSB high period remain for about 10.7us image6

Figure 3: The longer CSB high period remain for about 19.30us image3

All these delay sums up to be introducing a huge delay into my implementation. Any guidance or advice to reduce will be much appreciated. Thanks a lot!

Jephinus avatar Sep 09 '21 03:09 Jephinus

I'm sure a kernel driver could reduce at least the inter-transfer delay, but I don't have a feel for how much. Before investing time in that direction it might be educational to check the timing for an SPI device with an existing driver, e.g. the ENC28J60 Ethernet controller.

pelwell avatar Sep 09 '21 08:09 pelwell

@Jephinus: When your program invokes the SPI_IOC_MESSAGE ioctl, it eventually ends up in spi_transfer_one_message(). The function calls spi_set_cs() to assert Chip Select, then iterates over the transfers in the message and invokes the ->transfer_one() callback for each of them. Afterwards it deasserts Chip Select.

If the RasPi SPI controller is used (not the spi1/spi2 mini-SPI controllers, but spi0 or any of the other ones on the RasPi4), the ->transfer_one() callback is implemented by bcm2835_spi_transfer_one(). That function performs a bunch of calculations and register writes and, since your transfers contain only 7 bytes, should then branch to bcm2835_spi_transfer_one_poll().

The delay between CS assert and first clock cycle is likely caused by the various calculations and register accesses in spi_transfer_one_message() and bcm2835_spi_transfer_one(). There is some potential for optimization here. E.g., the spi-bcm2835.c driver could cache the last-used clock speed. If another consecutive transfer is performed with the same clock speed, the driver could skip the register write to the CLK register. Whether that actually yields a performance gain would have to be measured.

If you perform two consecutive printk() in the kernel, you'll notice that their time stamps are typically a couple dozen usec apart. Thus, a 3 usec delay for some calculations and register accesses is not implausible.

Note that it is possible to pass multiple transfers in a single message. E.g. you could amend your code to invoke a SPI_IOC_MESSAGE(10) ioctl and pass a pointer to an array of 10 transfers. By setting the cs_change bit in each transfer, you can force a chip-select toggle in-between two transfers. This may speed up back-to-back transfers.

In principle the SPI controller is capable of controlling Chip Select natively, i.e. it can automatically assert CS upon starting a transfer and deassert it at the end. But it was discovered that native Chip Select is a little glitchy (see commit a30a555d7435). Using GPIO Chip Select instead allows for optimizations such as pre-filling the FIFO before enabling interrupts (if bcm2835_spi_transfer_one_irq() is used instead of the _poll() variant -- only happens on larger transfers). For your special case (very small transfers, low latency desired, SPI mode 0) it may be possible to achieve lower latencies with native Chip Select. When we switched everything to GPIO Chip Select, we thought it wouldn't have any disadvantages. We didn't think of edge cases like this one I'm afraid.

I can't really offer an explanation why the latency between multiple back-to-back messages is jittery, but it may help to use a realtime kernel, i.e. with CONFIG_PREEMPT_RT_FULL=y. This requires that the RT patches have been applied to the kernel source tree.

l1k avatar Sep 09 '21 11:09 l1k

Thanks for providing detailed explanations. From my understanding, it seems impossible to further reduce the delay between CS assert and first clock cycle? Anything to do with CS setup time and hold time?

Can you explain more on how to pass multiple transfers in a single message by setting SPI_IOC_MESSAGE[14]? Let's say I've an array of spi_data_to_transfer[14] = {0x00,0x01,0x02,0x03,0x04,0x05,0x06,0xA0,0xA1,0xA2,0xA3,0xA4,0xA5,0xA6};

How do I modify the call to struct spi_ioc_transfer to send this array and have the cs to toggle for every 7 bytes send?

Sorry I'm not familiar with applying RT patches to my kernel source, let's say my kernel source is rpi-5.10.y, may I know what are the steps to apply the 5.10-rt to my kernel source as I can only find rpi-4.19.y-rt.

Jephinus avatar Sep 11 '21 13:09 Jephinus

I've hacked together a patch which caches the last-used clock speed and avoids setting it upon every SPI transfer if the current transfer's clock speed is the same as the last one's. This saves one register write and a couple of calculations (including some divisions, which may be expensive). The patch is the top-most commit on this branch. Could you try compiling a kernel based on that branch and re-test with your oscilloscope whether the 3 usec delay has shrunk? Instructions for (cross-)compiling and installing the kernel can be found here.

If the cs_setup / cs_hold times are zero (which they should be), they shouldn't have any effect. I think the delays you're seeing are caused by the register writes and calculations and the above-mentioned patch gets rid of some of them.

To send 14 bytes, instantiate and populate a struct spi_ioc_transfer mesg[2] array, set the cs_change bit in the first element of the array mesg[0] and transmit it with ioctl(fd, SPI_IOC_MESSAGE(2), mesg).

There is no rpi-5.10.y-rt branch and the rpi-4.19.y-rt branch seems unmaintained. You'd have to take the rpi-5.10.y branch as a basis and cherry-pick the RT patches from the v5.10-rt-rebase branch, solve merge conflicts (if any) and cherry-pick the "Fix USB/FIQ lock-ups" patch from rpi-4.19.y-rt. So this is non-trivial I'm afraid. The RT patches make lots of stuff preemptible, including many IRQ handlers, and this reduces latencies a fair bit. However, the clock cache patch I mentioned above may already help to get rid of the delays you're seeing.

l1k avatar Sep 14 '21 08:09 l1k

I've applied your latest patch into my kernel, and from my observation, the 3usec delay still the same, no sign of shrinking. The big delay for csb high is also still same, so overall SPI transfer time is the same.

Any examples I can follow for the struct spi_ioc_transfer mesg[2]? I believe I've set as what you've mentioned but issue I'm facing is the spi_ioc_message is sending the first 7 bytes only, so even it send in two rounds, but two rounds are sending the same FIRST 7 bytes only, but what I want will be first 7 bytes sent, then next 7 bytes sent.

I'm using rpi-5.10.y so that's why I am not sure what are the steps for me to apply real-time patch into my kernel, any guidelines that I can follow will be appreciated. Thanks.

Jephinus avatar Sep 15 '21 14:09 Jephinus

Hi, anybody can help to advise on my issue faced? Thanks a lot.

Jephinus avatar Sep 22 '21 01:09 Jephinus

I doubt you'll be able to improve the situation very much - the kernel SPI subsystem and drivers don't intentionally waste time. Even if you could reduce the best-case interval you would still be at the mercy of the scheduler not to be busy at a critical time.

I have had a suggestion of using the Pico (RP2040) as an SPI<->USB adapter, but that adds hardware and software complexity - it would need a small image on Pico and a bespoke Linux driver for the host side - probably not a route you would want to go down.

pelwell avatar Sep 22 '21 09:09 pelwell

hi everyone, I meet the same problem here. I am using ads1263 chip, which is spi ADC. I also have the problem that time from CS to SCLK is too big to slow down whole transform. it is not just 3us, but about 20us before the SCLK and about 10us after SCLK. the SCLK time is just about 10us. so it make the whole trans time to 40us. I use raspi3 modelB, and i am coding in c++ using spidev interface. my code is like below int8_t spi::transfer(uint8_t const* tx, uint8_t const* rx, uint32_t len) { if (fd >= 0) { spi_ioc_transfer spi; memset(&spi, 0, sizeof(spi));

	spi.tx_buf = (unsigned long)tx;
	spi.rx_buf = (unsigned long)rx;
	spi.len = len;
	spi.delay_usecs = 0;
	spi.speed_hz = spi_speed;
	spi.bits_per_word = 8;

	int32_t ret = -1;
	ret = ioctl(fd, SPI_IOC_MESSAGE(1), &spi);
	if (ret < 1) {
		cout << "can't send spi message" << endl;
		return -2;
	}
	else {
		return 0;
	}
}
else return -1;

}

I will post my scope picture soon. thanks in advance.

daiyicun avatar Jan 26 '24 07:01 daiyicun

wav0 wav1 yellow is SCLK, blue is trigger singal, and red is CS. when triger falling edge come , CS is goes low quickly, and call this trasfer fuction, after this function return, CS set to high.

daiyicun avatar Jan 26 '24 07:01 daiyicun

for more information, I do add a timer for this function, it really cost over 50 us

daiyicun avatar Jan 26 '24 07:01 daiyicun