nuttx [FEATURE] The design issues regarding DMA alignment in device drivers

Is your feature request related to a problem? Please describe.

RT

Describe the solution you'd like

During my recent research on STM32H7 + SDMMC, I identified several design issues in the NuttX device driver implementation. For example, the introduction of CONFIG_FAT_FORCE_INDIRECT is a flawed design decision. The current implementation logic of the fat_read function is as follows:

fat_read(...buf...)
{
    while (!read_end)
    {
        
#if CONFIG_FAT_FORCE_INDIRECT
        fat_hwread(buf + offset, 1~16 sectors);
#else
        fat_hwread(tempbuff + offset, 1 sector); // 
        memcpy(buf + offset, tempbuff, 512 bytes);
#endif
        offset += ...;
        
    }
}

I believe the root cause of this design stems from the chip's DMA transfer requirement, which mandates that data addresses must be N-byte aligned. However, in the fat_read function, the incoming buffer address may not meet this alignment requirement, necessitating this workaround. Otherwise, no developer would intentionally write such convoluted code.

Reviewing the code in stm32_sdmmc.c reveals that since the upper layer (MMCSD) does not handle DMA alignment, the low-level driver (stm32_sdmmc.c) must address alignment and DCache management. The current driver hierarchy is: vfs->fat->mmcsd->stm32_sdmmc.

We need to design a new mechanism to completely solve the DMA alignment issue. For instance, in scenarios like vfs->char device->stm32 uart driver (with DMA), similar alignment problems arise. Therefore, this mechanism should be applicable to all DMA transfer scenarios.

It is critical that low-level drivers remain as simple as possible to encourage broader participation in driver development. I have observed the following development model:

Design is more critical than implementation. Developers with philosophical thinking tend to design better models, but such individuals are rare. Most developers are implementers who excel at realizing existing designs and prefer simplicity.

Take the flash driver as an example: if the upper layer handles Flash read/write alignment, the low-level driver can be significantly simplified. For example, the flash_read function could be:

flash_read(...buf...)
{
    // Operate registers to write data
}

This way, the low-level driver does not need to handle DMA alignment. Designing a DMA Alignment Mechanism Considering the vfs->fat->mmcsd->stm32_sdmmc hierarchy, where should a DMA Alignment Manager be placed? Let’s first consider placing it in the FAT layer:

uint32_t get_bsp_mmcsd_dma_align_bytes(void) // Layout: STM32 SDMMC BSP
{
    return ARMV7M_DCACHE_LINESIZE;
}

uint32_t get_mmcsd_dma_align_bytes(void) // Layout: Device driver layer
{
    return get_bsp_mmcsd_dma_align_bytes();
}

typedef struct dma_align_manager 
{
    uint8_t *original_buf;
    uint8_t *aligned_buf;
    size_t aligned_buf_size;
    size_t aligned_buf_offset;
} dma_align_manager_t;

int dma_manager_init(dma_align_manager_t *manager, uint8_t *buf, size_t size, uint32_t align_bytes) // Layout: Independent component
{
    // Initialize DMA alignment manager
    manager->original_buf = buf;
    manager->aligned_buf = (uint8_t *)align_malloc(size, align_bytes); // align_bytes: e.g., 4, 8, 16, 32, 64
    if (!manager->aligned_buf)
    {
        // Allocation failed
        return -1;
    }
    manager->aligned_buf_offset = 0;
    return 0;
}

ssize_t fat_read(...uint8_t *buf...) // Layout: FAT layer
{
    uint32_t align_bytes = get_mmcsd_dma_align_bytes();
    if ((uint32_t)buf % align_bytes == 0) // Buffer is aligned
    {
        // Direct read: no alignment needed
        while (!read_end)
        {
            fat_hwread(buf + offset, 1~16 sectors);
            offset += ...;
        }
    }
    else // Buffer is unaligned
    {
        dma_align_manager_t dma_mgr;
        if (dma_manager_init(&dma_mgr, buf, max_sectors * 512, align_bytes) != 0)
        {
            return -ENOMEM;
        }
        while (!read_end)
        {
            fat_hwread(dma_mgr.aligned_buf + dma_mgr.aligned_buf_offset, 1~16 sectors);
            // Update offsets and copy data from aligned buffer to original buffer as needed
            dma_mgr.aligned_buf_offset += sectors * 512;
        }
        dma_manager_finalize(&dma_mgr); // Free aligned buffer
    }
    return total_bytes_read;
}

This approach automatically handles alignment and requires only one aligned memory allocation for multiple fat_hwread calls. Alternative: Placing the Manager in the MMCSD Layer Now consider moving the DMA alignment logic to the MMCSD layer:

ssize_t fat_read(...uint8_t *buf...) // Layout: FAT layer
{
    while (!read_end)
    {
        fat_hwread(buf + offset, 1~16 sectors);
        offset += ...;
    }
}

static ssize_t fat_hwread(uint8_t *buf, uint32_t sectors) // Layout: FAT layer
{
    return mmcsd_read(buf, sectors);
}

ssize_t mmcsd_read(uint8_t *buf, uint32_t sectors) // Layout: MMCSD layer
{
    uint32_t align_bytes = get_bsp_mmcsd_dma_align_bytes();
    bool use_dma = get_bsp_mmcsd_use_dma();
    
    if ((uint32_t)buf % align_bytes == 0) // Aligned buffer
    {
        if (use_dma)
        {
            return bsp_sdmmc_dma_read(buf, sectors); // Direct DMA transfer
        }
        else
        {
            return bsp_sdmmc_polled_read(buf, sectors); // Polled read
        }
    }
    else // Unaligned buffer
    {
        dma_align_manager_t dma_mgr;
        if (dma_manager_init(&dma_mgr, buf, sectors * 512, align_bytes) != 0)
        {
            return -ENOMEM;
        }
        
        ssize_t ret = 0;
        if (use_dma)
        {
            ret = bsp_sdmmc_dma_read(dma_mgr.aligned_buf, sectors); // DMA to aligned buffer
        }
        else
        {
            ret = bsp_sdmmc_polled_read(dma_mgr.aligned_buf, sectors); // Polled read to aligned buffer
        }
        
        // Copy data from aligned buffer to original buffer if necessary
        if (buf != dma_mgr.aligned_buf)
        {
            memcpy(buf, dma_mgr.aligned_buf, sectors * 512);
        }
        
        dma_manager_finalize(&dma_mgr);
        return ret;
    }
}

While this requires memory allocation for each read, it is more appropriate for the device driver layer (MMCSD) to handle DMA alignment rather than the FAT layer. Device drivers should consider the issue of DMA, while FAT should not General Applicability For other scenarios like vfs->char device->stm32 uart driver (with DMA), the DMA Alignment Manager should be integrated into the char device layer (e.g., UART driver), Key Takeaways Separation of Concerns: Hardware-specific alignment logic belongs in low-level drivers (e.g., MMCSD, UART), not in upper layers (FAT, VFS). Simpler Low-Level Drivers: By abstracting alignment via a reusable manager, low-level drivers become easier to develop and maintain, welcoming more contributors. Reusability: The DMA Alignment Manager can be generalized across all DMA-enabled drivers, reducing code duplication and improving consistency.

Thank you for reviewing this technical analysis! Let me know if further refinements are needed. My English is not very good, and the above content is translated by AI.

Describe alternatives you've considered

No response

Verification

[x] I have verified before submitting the report.

May 20 '25 12:05 snikeguo

Currently how nuttx works is that the mmc/sdio has to accept unaligned reads, currently this is solved in the driver itself, which is fine I think.

This proposal you're making however introduces more clutter I think, because each subsystem using mmc/sdio has to use your proposed dma_manager.

Furthermore this solution uses align_malloc, which is kinda unacceptable for an rtos, because it would break determinism. Currently the driver solution uses a pre-allocated buffer to fix alignment.

May 25 '25 07:05 PetervdPerk-NXP

@PetervdPerk-NXP Let me take an example with the MPC5606 Flash Driver binary provided by NXP. Many years ago, when I was using MPC56XX series chips, the Flash Driver interface provided by NXP was aligned to chip pages or sectors. I mentioned to NXP's FAE that my APP only requires 4-byte alignment at maximum, criticizing that their Driver was not user-friendly enough and failed to consider engineers' actual needs.

It was not until I worked on the RH850's Flash Align Manager that I realized how problematic it is to handle alignment at the lowest level. To enable arbitrary byte-level writes from upper layers, I had to invoke interfaces like Malloc and mutex.

Have you ever considered this: A driver framework should be generic code that exposes sufficient functionality to the application layer while providing the simplest possible interfaces to the chip boot layer. Not everyone can design high-quality driver frameworks, but if the chip driver layer is kept simple enough, it will allow more people to contribute to chip driver development. There are countless chip drivers in the world that need to be implemented – only a few can produce high-quality ones. Therefore, we should lower the barrier for developing chip drivers. Everyone can do 1+1=2, but only a few can handle calculus.

Jun 01 '25 07:06 snikeguo

I believe it ultimately comes down to clearly documenting the driver layers, so that everyone has a shared understanding of what the upper driver layer needs.

Have you ever considered this: A driver framework should be generic code that exposes sufficient functionality to the application layer while providing the simplest possible interfaces to the chip boot layer. Not everyone can design high-quality driver frameworks, but if the chip driver layer is kept simple enough, it will allow more people to contribute to chip driver development. There are countless chip drivers in the world that need to be implemented – only a few can produce high-quality ones. Therefore, we should lower the barrier for developing chip drivers. Everyone can do 1+1=2, but only a few can handle calculus.

Your proposal is for all DMA related subsystems, which is out-of-scope I believe. Please keep a "simplest possible interfaces" and just focus on MMC/SDIO.

Furthermore your proposed simplicity comes with a big cost, which is the use using dynamic memory allocation. I think should be avoided on a RTOS for resource-constrained devices.

Jun 01 '25 07:06 PetervdPerk-NXP

Furthermore your proposed simplicity comes with a big cost, which is the use using dynamic memory allocation. I think should be avoided on a RTOS for resource-constrained devices.

Actually, it won't. You can take a look at my PR code. If CONFIG_ARCH_HAVE_SDIO_PREFLIGHT is not defined, the buffer provided by the user will be used directly. If CONFIG_ARCH_HAVE_SDIO_PREFLIGHT is defined, the address will be checked first. If the address is aligned, the data will be transferred directly. If not, an aligned address will be allocated.

read( buf,buflen )
{
    #ifdef  CONFIG_ARCH_HAVE_SDIO_PREFLIGHT
             if( buf addr is not align?)
          { 
                    new align manager(buflen)
                  read sector(algin buffer)
          }
             else
             {read sector(buf)}
#else
          read sector (buf)
  #endif
}

If we don't use dynamic memory and POSIX, why are we using NuttX? Wouldn't it be better to use FreeRTOS or uC/OS? Why would I choose a relatively heavyweight framework like NuttX?

You can't have your cake and eat it too. Take security and convenience as an example—they've always been a contradiction.

I believe our goal should be to provide the application layer with as many convenient interfaces as possible, while ensuring that the interfaces exposed to the specific chip driver layer are as simple as possible to implement. The remaining complex tasks should be handled by the NUTTX device driver layer, which requires individuals with philosophical thinking to tackle.

Jun 01 '25 08:06 snikeguo

If we don't use dynamic memory and POSIX, why are we using NuttX? Wouldn't it be better to use FreeRTOS or uC/OS? Why would I choose a relatively heavyweight framework like NuttX?

Because NuttX isn't Linux, NuttX can even run on a Z80.

But lately there a lot of the changes/developers trying to turn NuttX in Linux/BSD which is counterproductive. In it's essence NuttX is an RTOS, anywhere we can avoid a malloc and gain some determinism is always a win it makes the system simpler and more predictable.

while ensuring that the interfaces exposed to the specific chip driver layer are as simple as possible to implement. The remaining complex tasks should be handled by the NUTTX device driver layer, which requires individuals with philosophical thinking to tackle.

Yeah, at this point it seems like we’ve got pretty different views on this especially around making things easier for driver developers vs. keeping the OS side maintainable. I don’t think this is heading toward anything super productive, so I’m gonna step back from this issue. Cheers.

Jun 01 '25 08:06 PetervdPerk-NXP

I have a lot of complaints. First of all, NUTTX is, after all, an open source project, not a project backed by a financially powerful company like Apple behind LLVM. So I really don't understand why NUTTX doesn't use the drivers provided by various manufacturers, but instead insists on re-implementing everything from scratch. This is extremely disadvantageous for the NUTTX project. We all hope NUTTX can become as famous as Linux. To do that, we can use some opportunistic approaches to quickly support a wide range of chips and shift chip-specific bugs to the chip manufacturers. For example, we could use STM32's LL library, etc. Just look at Zephyr—they are doing exactly this.Let’s solve the problem of having enough to eat first, and then think about eating meat.

Because NuttX isn't Linux, NuttX can even run on a Z80.

As for your point, I also can't understand it. Why not just gather some statistics on the market share of Z80 running NUTTX/Zephyr? The Z80 is comparable to the STC8051. If I were a boss and my R&D engineer told me they wanted to use Z80+NUTTX or STC8051+NUTTX, I would definitely think there is something wrong with their thinking. Does this combination make any sense? If it's just to show off one's technical skills, then just port NUTTX to Z80 by yourself.

We need to focus on reality—use simple solutions for simple needs and complex solutions for complex needs. Otherwise, it’s pointless.

Jun 01 '25 09:06 snikeguo

I have a lot of complaints. First of all, NUTTX is, after all, an open source project, not a project backed by a financially powerful company like Apple behind LLVM. So I really don't understand why NUTTX doesn't use the drivers provided by various manufacturers, but instead insists on re-implementing everything from scratch. This is extremely disadvantageous for the NUTTX project. We all hope NUTTX can become as famous as Linux. To do that, we can use some opportunistic approaches to quickly support a wide range of chips and shift chip-specific bugs to the chip manufacturers. For example, we could use STM32's LL library, etc. Just look at Zephyr—they are doing exactly this.Let’s solve the problem of having enough to eat first, and then think about eating meat.

This is just your personal opinion. Some people are in this project precisely because NuttX IS NOT like Zephyr.

I think you should familiarize yourself with https://github.com/apache/nuttx/blob/master/INVIOLABLES.md

Jun 01 '25 09:06 raiden00pl

@raiden00pl You are idealists, and although I don’t agree with your design, I am more of a pragmatist. The reason I am sharing these thoughts is for the benefit of those who will learn NuttX in the future—for example, people who want to add driver support for a new chip. When they feel confused, maybe they can find this conversation by searching for keywords, and it might offer them some comfort. That’s all.

Jun 01 '25 09:06 snikeguo