SdFat icon indicating copy to clipboard operation
SdFat copied to clipboard

bench and other examples don't work on STM32L053

Open microtronics opened this issue 4 years ago • 27 comments

Hi!

I did a lot of testing and it seems the code doesn't work on STM32L boards with Arduino.

I get everything up and running using an Arduino Nano, so both the SD and the connections are fine. It also works when i drop the version to a V1 release (I haven't tested more V2 releases). I checked that HAS_SDIO stays 0 so it is expected to work via SPI the same way it does for the Nano.

I also dropped the SPI-Frequency down to 1MHz, but it doesn't change anything either.

The bench example hangs just after reporting the available stack.

Any Idea why your library doesn't work on STM32Ls anymore?

microtronics avatar Jun 15 '21 14:06 microtronics

I tested bench with no mods on a Nucleo L476RG with the current version of SdFat and Arduino Core for STM32 2.0.0.

Build output:

Sketch uses 36488 bytes (3%) of program storage space. Maximum is 1048576 bytes.
Global variables use 2988 bytes (3%) of dynamic memory, leaving 95316 bytes for local variables. Maximum is 98304 bytes.
1 File(s) copied
Upload complete on NODE_L476RG (H:)

bench output:

Use a freshly formatted SD for best performance.

Type any character to start
FreeStack: 95236
Type is FAT32
Card size: 32.01 GB (GB = 1E9 bytes)

Manufacturer ID: 0X1B
OEM ID: SM
Product: 00000
Version: 1.0
Serial number: 0X1550FE3
Manufacturing date: 10/2015

FILE_SIZE_MB = 5
BUF_SIZE = 512 bytes
Starting write test, please wait.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
401.93,1291,1271,1272
401.90,1289,1271,1272

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
398.88,1299,1281,1282
398.91,1300,1281,1282

Done

Don't have a clue what your problem is.

greiman avatar Jun 15 '21 16:06 greiman

Just tried it again on my laptop with a fresh install of Arduino and all libraries. The issue persists and the compile sizes differ:

"C:\\Users\\xxx\\AppData\\Local\\Arduino15\\packages\\STMicroelectronics\\tools\\xpack-arm-none-eabi-gcc\\9.3.1-1.3/bin/arm-none-eabi-size" -A "C:\\Users\\xxx\\AppData\\Local\\Temp\\arduino_build_515610/bench.ino.elf"
Der Sketch verwendet 39492 Bytes (3%) des Programmspeicherplatzes. Das Maximum sind 1048576 Bytes.
Globale Variablen verwenden 2964 Bytes (3%) des dynamischen Speichers, 95340 Bytes für lokale Variablen verbleiben. Das Maximum sind 98304 Bytes.

I'm using Windows 8.1 and Windows 10. Are you using Linux maybe?

microtronics avatar Jun 15 '21 18:06 microtronics

I now did another Test in an Ubuntu VM with a fresh install of Arduino and everything and the issue remains the same. Installed the latest Arduino, SdFat 2.0.6 and the STM32 2.0.0 core:

Sketch uses 39396 bytes (3%) of program storage space. Maximum is 1048576 bytes.
Global variables use 2964 bytes (3%) of dynamic memory, leaving 95340 bytes for local variables. Maximum is 98304 bytes.

The size of the compiled code still differs from yours, so I'm pretty sure there still is an issue anywhere...

microtronics avatar Jun 15 '21 19:06 microtronics

I am using Windows 10 on a new PC I built 6 weeks ago with an ASUS Prime Z590-A motherboard. New install of everything at the end of April.

There is some major difference in your system. I am using Arduino IDE 1.8.15. Here is the size of "hello, world"

void setup() {
  Serial.begin(9600);
  Serial.print("hello, world\n");
}
void loop() {
}
Sketch uses 12320 bytes (1%) of program storage space. Maximum is 1048576 bytes.
Global variables use 900 bytes (0%) of dynamic memory, leaving 97404 bytes for local variables. Maximum is 98304 bytes.
1 File(s) copied
Upload complete on NODE_L476RG (H:)

greiman avatar Jun 15 '21 19:06 greiman

Der Sketch verwendet 14772 Bytes (1%) des Programmspeicherplatzes. Das Maximum sind 1048576 Bytes.
Globale Variablen verwenden 868 Bytes (0%) des dynamischen Speichers, 97436 Bytes für lokale Variablen verbleiben. Das Maximum sind 98304 Bytes.

However, I'm getting the exact same results for all code examples on both of my computers, so I would allege that something is different on your machine...

boardSettings

microtronics avatar Jun 15 '21 20:06 microtronics

Maybe turning on compile output on your machine reveals something...

microtronics avatar Jun 15 '21 20:06 microtronics

It seems like it is dying upon evaluating:

inline uint32_t getLe32(const uint8_t* src) {
  return *reinterpret_cast<const uint32_t*>(src);
}

For FatPartition.cpp:422 volumeStartSector = getLe32(mp->relativeSectors); After that he goes into an infinite loop as a fault handler

microtronics avatar Jun 15 '21 23:06 microtronics

changing USE_SIMPLE_LITTLE_ENDIAN to 0 fixes the issue... The only question is: why?

microtronics avatar Jun 16 '21 00:06 microtronics

That means STM32L4 needs to use the slower byte unpacking functions. Try editing this part of SdFatConfig.h

/**
 * Set USE_SIMPLE_LITTLE_ENDIAN nonzero for little endian processors
 * with no memory alignment restrictions.
 */
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ && !defined(__SAMD21G18A__)\
  && !defined(__MKL26Z64__) && !defined(ESP8266)
#define USE_SIMPLE_LITTLE_ENDIAN 1
#else  // __BYTE_ORDER_
#define USE_SIMPLE_LITTLE_ENDIAN 0
#endif  // __BYTE_ORDER_

Strange, I have gone to an old PC and get about the same sizes as you but bench still runs.

You can speed up by a factor of three by this edit to SdFatConfig.h.

/**
 * If USE_SPI_ARRAY_TRANSFER is non-zero and the standard SPI library is
 * use, the array transfer function, transfer(buf, size), will be used.
 * This option will allocate up to a 512 byte temporary buffer for send.
 * This may be faster for some boards.  Do not use this with AVR boards.
 */
#define USE_SPI_ARRAY_TRANSFER 1

greiman avatar Jun 16 '21 00:06 greiman

I will always use the byte access unpack functions in the future.

FAT structures are packed so Cortex M0 will have problems. M3, M4, and M7 should be OK but it's not worth the risk.

I suspect a compiler change. There are about four Cortex M3, M4, M7 instructions that do fault on unaligned access.

One of the big selling points of ARM Cortex-M is that it doesn’t care about alignment. It all “just works”. Well, except for this footnote: "Unaligned LDM, STM, LDRD, and STRD instructions always fault irrespective of the setting of UNALIGN_TRP"

greiman avatar Jun 16 '21 00:06 greiman

Yeah, I agree that it seems quite hard to guarantee alignment on the FAT data structures.

Here it is suggested to use memcpy and the compiler appears clever enough to optimize these accesses on compatible targets automatically, so hopefully it's a win-win because you won't have to think about it no matter what architecture is used.

microtronics avatar Jun 16 '21 07:06 microtronics

One more thing: I used a 2GB card -> FAT16. Did you use a FAT32 card? Probably there are less alignment issues...

microtronics avatar Jun 16 '21 08:06 microtronics

Actually FAT16 and FAT32 have the same directory structure and FAT32 has the FAT16 volume structure plus extra 32-bit fields. You must read all FAT16 fields before the extra FAT32 fields.

This is how Microsoft maintained comparability, zeros in some 16-bit fields tell you to read 32-bit FAT32 fileds.

memcpy is not useful since the full byte functions handle litte-endian to processor endian as well as processor alignment.

The key place where I will use the cast method is AVR. The cast saves flash and is faster on Uno.

greiman avatar Jun 16 '21 11:06 greiman

memcpy is not useful since the full byte functions handle litte-endian to processor endian as well as processor alignment.

Actually this is done by a built-in function, but since you already wrote the methods for all relevant types and the compiler should optimize these too, it should be fine that way.

Just FYI: Requires O2 or Os to get the optimal result https://godbolt.org/z/756odPEn4 For these two variants -O1 is sufficient to get good code: https://godbolt.org/z/dT43hGjPc https://godbolt.org/z/GK767MT83

all three examples compile to the exact same asssembly on AVR-GCC

microtronics avatar Jun 16 '21 12:06 microtronics

On little-endian memcpy is the same as shift and or. Try your compile examples on big-endian. That's why I have this: #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ in the cast test.

memcpy does not convert endian. All FAT implementations I know of use the shift and or method. It is not worth worrying about optimization, the overhead in speed is small in converting the volume entries. For SdFat, the implementation of the SPI driver is key.

I implemented a DMA driver for Due and Teensy. They are much faster than STM32 SPI. The Arduino STM32 SPI driver need to be reworked to use DMA for transfer(buf, count).

I was able to increase the speed of the STM32L476 from 400 KB/sec to over 1100 KB/sec with the edit of USE_SPI_ARRAY_TRANSFER. Teensy gets about 5000 KB/sec with DMA SPI.

My SDIO driver on Teensy runs at 22 MB/sec but the STM32 SDIO driver is very slow because STM32 controllers before H7 are dogs.

I was able to get your sizes. I deleted all STM32 packages and reinstalled STM32 V2.0. Both 2GB and 32GB cards work on my STM32L476 with the casts.

Sketch uses 39492 bytes (3%) of program storage space. Maximum is 1048576 bytes.
Global variables use 2964 bytes (3%) of dynamic memory, leaving 95340 bytes for local variables. Maximum is 98304 bytes.
1 File(s) copied
Upload complete on NODE_L476RG (H:)

greiman avatar Jun 16 '21 12:06 greiman

On the L053 I'm getting 300KB/s with array transfer. It is only 32MHz so these values are similar enough. Enabling DMA would of course be awesome, but since the L-series are not part of the other Arduino-Cores (maple or rogerclark) and the official currently has no DMA-support it would be a good portion of work to get this up and running, I'm afraid....

microtronics avatar Jun 16 '21 13:06 microtronics

DMA would not be too hard since SdFat supports custom drivers and STM32 Arduino has the L0 drivers here:

C:\Users\Bill\AppData\Local\Arduino15\packages\STMicroelectronics\hardware\stm32\2.0.0\system\Drivers\STM32L0xx_HAL_Driver

They just didn't do the Arduino wrapper for

HAL_StatusTypeDef HAL_SPI_Transmit_DMA(SPI_HandleTypeDef *hspi, uint8_t *pData, uint16_t Size);
HAL_StatusTypeDef HAL_SPI_Receive_DMA(SPI_HandleTypeDef *hspi, uint8_t *pData, uint16_t Size);

greiman avatar Jun 16 '21 13:06 greiman

I've done DMA using cube on another controller already and AFAIR you also have to initialize the dma itself and the RCC module to clock and power it correctly

microtronics avatar Jun 16 '21 13:06 microtronics

I found that Uno uses more flash with shift/or and many Uno users are at the limit so I plan to use this condition for pack/unpack. Could you test it on STM32L053. __ARM_FEATURE_UNALIGNED means the the mpu supports unaligned access.

/**
 * Set USE_SIMPLE_LITTLE_ENDIAN nonzero for little endian processors
 * with no memory alignment restrictions.
 */
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__\
  && (defined(__AVR__) || defined(__ARM_FEATURE_UNALIGNED))
#define USE_SIMPLE_LITTLE_ENDIAN 1
#else  // __BYTE_ORDER_
#define USE_SIMPLE_LITTLE_ENDIAN 0
#endif  // __BYTE_ORDER_

Here is a test program that checks the Cortex M features.

void setup() {
  Serial.begin(9600);
  while (!Serial) {}
#if defined(__CORTEX_M)
  Serial.print("Cortex M");
  Serial.println(__CORTEX_M);
#else
  Serial.println("not Cortex M");
#endif

#ifdef __ARM_FEATURE_UNALIGNED
  Serial.println("supports unaligned");
#else
  Serial.println("no unaligned support");
#endif
}

void loop() {
}

I looked at the Cube code for SPI on STM32L053 and the cpu reference. I agree that DMA wouldn't be worth the effort. STM32L0 is great for low power but that means limited performance for SPI.

greiman avatar Jun 17 '21 14:06 greiman

The output of the sketch is as expected:

Cortex M0
no unaligned support

and the fix for the SIMPLE_LITTLE_ENDIAN appears to work, too. I get the same speeds as before on both the nano and the L053.

For the L053 I'm a bit disappointed about the speed. For the current use the lib will be used on a L476 to load images for GUISLICE from the SD, but I guess even on that it will only be about as fast as on the nano.

However, using the "normal" HAL_SPI_Transmit and receive is currently even slower than the default driver. And as I expected, HAL_SPI_TX/RX_DMA doesn't work out of the box. I'll spend a few more hours onfinding what intialization is missing to get it running, but not for too long as the slower speed should already be sufficient for the task.

In theory the limit for the L053 should be at 2MB/s for 16MHz SPI speed... Though even ST states that you can't get the full speed without DMA, because feeding data takes more cycles than transmitting at that speed.

microtronics avatar Jun 17 '21 14:06 microtronics

Thanks for testing.

All M0 Arduino boards I have put on a logic analyzer run slower than a Uno at 8 MHz.

Uno at 8MHz:

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
683.06,13084,728,743
689.66,4396,728,736

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
670.96,912,752,757
670.96,916,752,756

The M0 drivers appear to have a tight loop but the logic analyzer shows bytes clocked out at 16 MHz but huge gaps between bytes.

Adafruit has a SAMD21 DMA driver but at max SCK frequency many SD cards fail. Even the Arduino SAMD driver fails for some cards. There are lots of reports of SPI errors at high speed for other devices on SAMD.

Too bad, there is a real need for low power battery powered devices. I spent time helping SdFat support a "cave logger" that could run for almost a year on two AA batteries.

Our performance benchmark is at least one full year of operation on standard AA batteries.

greiman avatar Jun 17 '21 15:06 greiman

I finally got DMA to work. Currently only used for reading:

FILE_SIZE_MB = 10
BUF_SIZE = 512 bytes
Starting write test, please wait.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
103.66,56454,4889,4933
103.82,49518,4889,4925

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1349.71,376,371,372
1349.71,376,371,372

This is about what I expect for the L053!

microtronics avatar Jun 17 '21 19:06 microtronics

Great improvement. I think I see the problem with the STM32 SPI wrapper.

The SPI controller has a buffer register and shift register for both RX and TX. The loop used by transfer(buf, count) appears to use the buffer registers but does not really use the buffer registers.

The loop waits for TXE then writes the TX buffer next waits for RXNE then reads the RX buffer. This mean the buffers are empty most of the time.

If the SPI controller works properly you can pre-write the TX buffer before the loop and read the last byte after the loop. It depends on RX holding a byte in the shift register until the RX buffer is empty otherwise an interrupt can cause a receive overrun.

Looks like the the STM32L053 just overwrites the RX buffer as soon as the shift register is full so you can't really use the buffers in full duplex mode for transfer(buf, count).

DMA is the best answer to use of the buffers in the SPI controller.

greiman avatar Jun 18 '21 14:06 greiman

I rearranged the loop in the ST wrapper and made slightly better use of the SPI controller and got a substantial improvement in bench without danger of receive overrun.

Here is bench on STM32L476

ST way:

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1165.23,455,436,437
1164.96,455,436,437

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1158.21,445,439,440
1158.21,444,439,440

My loop:

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1724.14,312,294,295
1724.14,312,294,295

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1708.82,303,297,298
1709.40,302,297,298

Here are the changes in C:\Users\Bill\AppData\Local\Arduino15\packages\STMicroelectronics\hardware\stm32\2.0.0\libraries\SPI\src\utility\spi_com.c

the symbol OLD_WAY picks the loop. I didn't fix it for skipReceive or timeout.

#ifdef OLD_WAY
  while (size--) {
#if defined(STM32H7xx) || defined(STM32MP1xx)
    while (!LL_SPI_IsActiveFlag_TXP(_SPI));
#else
    while (!LL_SPI_IsActiveFlag_TXE(_SPI));
#endif
    LL_SPI_TransmitData8(_SPI, *tx_buffer++);

    if (!skipReceive) {
#if defined(STM32H7xx) || defined(STM32MP1xx)
      while (!LL_SPI_IsActiveFlag_RXP(_SPI));
#else
      while (!LL_SPI_IsActiveFlag_RXNE(_SPI));
#endif
      *rx_buffer++ = LL_SPI_ReceiveData8(_SPI);
    }
    if ((Timeout != HAL_MAX_DELAY) && (HAL_GetTick() - tickstart >= Timeout)) {
      ret = SPI_TIMEOUT;
      break;
    }
  }
#else  // OLD_WAY
  while (!LL_SPI_IsActiveFlag_TXE(_SPI));
  LL_SPI_TransmitData8(_SPI, *tx_buffer++);
  size--;
  while (size--) {
    while (!LL_SPI_IsActiveFlag_RXNE(_SPI));
    *rx_buffer++ = LL_SPI_ReceiveData8(_SPI);   
    LL_SPI_TransmitData8(_SPI, *tx_buffer++);
    if ((Timeout != HAL_MAX_DELAY) && (HAL_GetTick() - tickstart >= Timeout)) {
      ret = SPI_TIMEOUT;
      break;
    }
  }
  while (!LL_SPI_IsActiveFlag_RXNE(_SPI));
  *rx_buffer++ = LL_SPI_ReceiveData8(_SPI);
#endif  // OLD_WAY

greiman avatar Jun 18 '21 15:06 greiman

Here is the code I used for DMA on the L4 and it works on the L0 with minor modifications (DMA IRQhandler and channel), too:

DMA_HandleTypeDef hdma_spi2_rx, hdma_spi2_tx;

class MySpiClass : public SdSpiBaseClass {
public:
    void activate()
    {
        uint32_t spi_freq = HAL_RCC_GetPCLK1Freq();
        if (spiClockSpeed >= (spi_freq / 2)) {
            hspi2.Init.BaudRatePrescaler = SPI_BAUDRATEPRESCALER_2;
        } else if (spiClockSpeed >= (spi_freq / 4)) {
            hspi2.Init.BaudRatePrescaler = SPI_BAUDRATEPRESCALER_4;
        } else if (spiClockSpeed >= (spi_freq / 8)) {
            hspi2.Init.BaudRatePrescaler = SPI_BAUDRATEPRESCALER_8;
        } else if (spiClockSpeed >= (spi_freq / 16)) {
            hspi2.Init.BaudRatePrescaler = SPI_BAUDRATEPRESCALER_16;
        } else if (spiClockSpeed >= (spi_freq / 32)) {
            hspi2.Init.BaudRatePrescaler = SPI_BAUDRATEPRESCALER_32;
        } else if (spiClockSpeed >= (spi_freq / 64)) {
            hspi2.Init.BaudRatePrescaler = SPI_BAUDRATEPRESCALER_64;
        } else if (spiClockSpeed >= (spi_freq / 128)) {
            hspi2.Init.BaudRatePrescaler = SPI_BAUDRATEPRESCALER_128;
        } else {
            hspi2.Init.BaudRatePrescaler = SPI_BAUDRATEPRESCALER_256;
        }
        HAL_SPI_Init(&hspi2);       //changing the BR register directly would be easier and less costly
    }
    
    void begin(SdSpiConfig config)
    {
        hspi2.Instance = SPI2;
        hspi2.Init.Mode = SPI_MODE_MASTER;
        hspi2.Init.Direction = SPI_DIRECTION_2LINES;
        hspi2.Init.DataSize = SPI_DATASIZE_8BIT;
        hspi2.Init.CLKPolarity = SPI_POLARITY_LOW;
        hspi2.Init.CLKPhase = SPI_PHASE_1EDGE;
        hspi2.Init.NSS = SPI_NSS_SOFT;
        hspi2.Init.BaudRatePrescaler = SPI_BAUDRATEPRESCALER_128;
        hspi2.Init.FirstBit = SPI_FIRSTBIT_MSB;
        hspi2.Init.TIMode = SPI_TIMODE_DISABLE;
        hspi2.Init.CRCCalculation = SPI_CRCCALCULATION_DISABLE;
        hspi2.Init.CRCPolynomial = 7;
        hspi2.Init.NSSPMode = SPI_NSS_PULSE_DISABLE;
        if (HAL_SPI_Init(&hspi2) != HAL_OK) {
            Serial.println("Error SPI2");
        }

        GPIO_InitTypeDef GPIO_InitStruct = { 0 };

        __HAL_RCC_SPI2_CLK_ENABLE();
        __HAL_RCC_GPIOB_CLK_ENABLE();

        GPIO_InitStruct.Pin = GPIO_PIN_13 | GPIO_PIN_14 | GPIO_PIN_15;
        GPIO_InitStruct.Mode = GPIO_MODE_AF_PP;
        GPIO_InitStruct.Pull = GPIO_NOPULL;
        GPIO_InitStruct.Speed = GPIO_SPEED_FREQ_VERY_HIGH;
        GPIO_InitStruct.Alternate = GPIO_AF5_SPI2;
        HAL_GPIO_Init(GPIOB, &GPIO_InitStruct);

        /* SPI2 DMA Init */
        __HAL_RCC_DMA1_CLK_ENABLE();
        HAL_NVIC_SetPriority(DMA1_Channel4_IRQn, 0, 0);
        HAL_NVIC_EnableIRQ(DMA1_Channel4_IRQn);
        HAL_NVIC_SetPriority(DMA1_Channel5_IRQn, 0, 0);
        HAL_NVIC_EnableIRQ(DMA1_Channel5_IRQn);
        /* SPI2_RX Init */
        hdma_spi2_rx.Instance = DMA1_Channel4;
        hdma_spi2_rx.Init.Request = DMA_REQUEST_1;
        hdma_spi2_rx.Init.Direction = DMA_PERIPH_TO_MEMORY;
        hdma_spi2_rx.Init.PeriphInc = DMA_PINC_DISABLE;
        hdma_spi2_rx.Init.MemInc = DMA_MINC_ENABLE;
        hdma_spi2_rx.Init.PeriphDataAlignment = DMA_PDATAALIGN_BYTE;
        hdma_spi2_rx.Init.MemDataAlignment = DMA_MDATAALIGN_BYTE;
        hdma_spi2_rx.Init.Mode = DMA_NORMAL;
        hdma_spi2_rx.Init.Priority = DMA_PRIORITY_HIGH;
        if (HAL_DMA_Init(&hdma_spi2_rx) != HAL_OK) {
            Serial.println("Error DMA RX");
        }

        __HAL_LINKDMA(&hspi2, hdmarx, hdma_spi2_rx);

        /* SPI2_TX Init */
        hdma_spi2_tx.Instance = DMA1_Channel5;
        hdma_spi2_tx.Init.Request = DMA_REQUEST_1;
        hdma_spi2_tx.Init.Direction = DMA_MEMORY_TO_PERIPH;
        hdma_spi2_tx.Init.PeriphInc = DMA_PINC_DISABLE;
        hdma_spi2_tx.Init.MemInc = DMA_MINC_ENABLE;
        hdma_spi2_tx.Init.PeriphDataAlignment = DMA_PDATAALIGN_BYTE;
        hdma_spi2_tx.Init.MemDataAlignment = DMA_MDATAALIGN_BYTE;
        hdma_spi2_tx.Init.Mode = DMA_NORMAL;
        hdma_spi2_tx.Init.Priority = DMA_PRIORITY_VERY_HIGH;
        if (HAL_DMA_Init(&hdma_spi2_tx) != HAL_OK) {
            Serial.println("Error DMA TX");
        }

        __HAL_LINKDMA(&hspi2, hdmatx, hdma_spi2_tx);
        
        (void)config;
    }
    // Deactivate SPI hardware.
    void deactivate()
    {
        // SPI.endTransaction();  //original endTransaction only removes config from cache
    }
    // Receive a byte.
    uint8_t receive()
    {
        uint8_t v = 0xFF;
        HAL_SPI_Receive(&hspi2, &v, 1, 1000);
        return v;
    }

    uint8_t receive(uint8_t* buf, size_t count)
    {
         //for (size_t i = 0; i < count; i++) {
         //    buf[i] = receive();
         //}

        // while (!__HAL_SPI_GET_FLAG(&hspi2, SPI_FLAG_TXE))
        //     ;
        // hspi2.Instance->DR = 0xFF;
        // count--;
        // while (count--) {
        //     while (!__HAL_SPI_GET_FLAG(&hspi2, SPI_FLAG_RXNE))
        //         ;
        //     *buf++ = hspi2.Instance->DR;
        //     hspi2.Instance->DR = 0xFF;
        // }
        // while (!__HAL_SPI_GET_FLAG(&hspi2, SPI_FLAG_RXNE))
        //     ;
        // *buf++ = hspi2.Instance->DR;


        __HAL_DMA_DISABLE(&hdma_spi2_rx);             //not sure if it is necessary to disable first
        __HAL_DMA_DISABLE(&hdma_spi2_tx);             //but some settings should not be changed while it is active
        hdma_spi2_rx.Instance->CCR |= DMA_CCR_MINC;
        hdma_spi2_tx.Instance->CCR &= ~DMA_CCR_MINC;

        uint8_t v = 0xFF;
        
        HAL_SPI_TransmitReceive_DMA(&hspi2, &v, buf, count);

        while (hspi2.hdmarx->State & HAL_DMA_STATE_BUSY)
            ;
            
        return 0;
    }

    void send(uint8_t data)
    {
        HAL_SPI_Transmit(&hspi2, &data, 1, 1000);
    }
    
    void send(const uint8_t* buf, size_t count)
    {
         //for (size_t i = 0; i < count; i++) {
         //    send(buf[i]);
         //}
         
        __HAL_DMA_DISABLE(&hdma_spi2_rx);
        __HAL_DMA_DISABLE(&hdma_spi2_tx);
        hdma_spi2_tx.Instance->CCR |= DMA_CCR_MINC;
        hdma_spi2_rx.Instance->CCR &= ~DMA_CCR_MINC;

        uint8_t rdVal;
        // HAL_SPI_Transmit_DMA(&hspi2, sbuf, count);       //This would overwrite the buffer with received values
        HAL_SPI_TransmitReceive_DMA(&hspi2, (uint8_t *)buf, &rdVal, count);

        while (hspi2.hdmatx->State & HAL_DMA_STATE_BUSY)    //This is already true when DMA triggered the last transmission
            ;
         while (hspi2.Instance->SR & SPI_SR_BSY)            //This waits till SPI is really finished. Alternatively, use hdmarx state
             ;
    }
    
    void setSckSpeed(uint32_t maxSck)
    {
        spiClockSpeed = maxSck;
    }

private:
    SPI_HandleTypeDef hspi2;
    uint32_t spiClockSpeed;
} mySpi;

extern "C" void DMA1_Channel4_IRQHandler(void);
void DMA1_Channel4_IRQHandler(void)
{
    HAL_DMA_IRQHandler(&hdma_spi2_rx);
}

extern "C" void DMA1_Channel5_IRQHandler(void);
void DMA1_Channel5_IRQHandler(void)
{
    HAL_DMA_IRQHandler(&hdma_spi2_tx);
}

It is pretty much hacked together from CubeMX code, but it is a starting point if anyone wants to build a "clean" version. Actually, the whole thing would be a lot easier, if you could get the spi handle out of the Arduino-style SPIClass, but it is private of course.

If the DMA transmission was triggered manually instead of using the HAL_SPI_xxxDMA-methods, it could be simplified more, because the IRQ callbacks could be left out and a simple transmission DMA would be enough for sending instead of using transmitReceive on a receive address without increment... This probably wouldn't work for receiving, because SD-cards are bitchy if they don't receive 0xFF when being read out and it really depends on the SPI implementation what is being sent, when only reading is triggered.

The HardFault issues are solved, so from my side this can be closed. Thanks for the work you put in!

microtronics avatar Jun 21 '21 11:06 microtronics

I have posted a custom SPI driver on SdFat-beta so the default speeds are much faster for higher end chips like STM32F4. It use transfer(tx_buf, rx_buf, count) for send but I still need to fill the buffer with 0XFF for receive.

Here are speeds users will see by default with no mods for a STM32F446RE.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1928.15,4710,263,264
1933.36,1330,263,264

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1824.70,282,279,280
1824.70,282,279,280

There are too many STM32 boards to specialize per board and I don't want the maintenance problem. There are three major variants of the STM32 SPI controller plus DMA variants and errata in DMA.

I must have over a hundred Arduino like boards. Sparkfun, Teensy, and Adafruit send me boards and modules. When I wrote the Teensy SDIO driver that gets 22,000 KB/sec, Paul Stoffregen sent me 4-5 of every Teensy variant plus every add on module, audio, enet, display, etc.

I may post an issue on the STM32 Core site for an improved wrapper. ST is still working on an H7 bug I posted months ago. They verified it and it is one of six they noted for V2.0 since SPI fails for H7 at less than 400 KHz.

greiman avatar Jun 21 '21 13:06 greiman

This probably wouldn't work for receiving, because SD-cards are bitchy if they don't receive 0xFF when being read out and it really depends on the SPI implementation what is being sent, when only reading is triggered.

SD cards are full duplex and in a multi-block transfer they look at receive for commands. That's why you can't send random junk.

I use an infinite transfer in dedicated SPI mode so I must be able to terminate the transfer by sending a command.

Modern SD card require huge transfers for high speed. Flash pages can be as large as 128KB so 512 byte sectors are emulated using RAM in the card. Doing a single 512 byte write and releasing chip select causes an entire flash page to be programed.

greiman avatar Jun 21 '21 14:06 greiman