esp-idf icon indicating copy to clipboard operation
esp-idf copied to clipboard

Persist DMA buffer in sdmmc_read_sectors/sdmmc_write_sectors (IDFGH-12770)

Open ducalex opened this issue 1 year ago • 7 comments
trafficstars

Is your feature request related to a problem?

When internal memory runs out, SD card reads and writes start returning ESP_ERR_NO_MEM because this line fails:

void* tmp_buf = heap_caps_malloc(block_size, MALLOC_CAP_DMA);

In theory SPIRAM_MALLOC_RESERVE_INTERNAL should ensure that memory is always available but in practice it doesn't always work.

There are likely also performance implications to allocating/deallocating all the time (in some cases).

Having no DMA-capable memory left is obviously an undesirable situation in general, but it can happen and when it does we might still need SD card access.

Describe the solution you'd like.

My understanding is that block_size is at most 512 bytes: out_csd->sector_size = MIN(read_bl_size, 512);.

Therefore there is no reason to not keep the buffer around after it's first allocated for future use. In fact with that size it could be allocated preemptively when the SD card is initialized, or even placed in static BSS. Because if it's needed once, it will very likely be needed again!

Describe alternatives you've considered.

My current solution is this:

  1. Reserve a modest DMA-capable memory block on start
  2. Wrap sdmmc_host_t's do_transaction() to detect errors
  3. If an error occurs, free the reserved block and try the transaction again
  4. If the transaction succeeds, try to grab the block back

But this is far from ideal and doesn't always work...

Additional context.

#6596 suggested the same but was closed because of the emphasis on performance. I think the emphasis should be on stability improvements if this change is introduced.

ducalex avatar May 05 '24 21:05 ducalex

My understanding is that block_size is at most 512 bytes

Hmm I think that's not true? There are quite a lot of those big SD cards that comes with 1024 or even 4096 bytes sector.

Even some SD NAND chips for industrial purposes uses 1024 bytes sector. I've definitely seen a 16GB one doing that but I can't remember the part number

huming2207 avatar May 06 '24 02:05 huming2207

Hi @ducalex, From what you have described so far, it looks like the rest of application is using DMA capable memory space so heavily, that there isn't even 512B block left at certain moment, when it comes to SD card access. I think keeping the buffer for SD allocated statically aka "forever" would just expose another place in the code, where the DMA capable memory won't be available when needed. Could you check the rest of your code for source of potential leak / fragmentation of DMA capable RAM ?

rrtandler avatar May 06 '24 14:05 rrtandler

Hmm I think that's not true? There are quite a lot of those big SD cards that comes with 1024 or even 4096 bytes sector.

That is what I thought at first, but when I looked up the actual code in esp-idf, the only assignment I've found was the line I've quoted. Which would result in 512 or less.

Perhaps I missed another code path somewhere?

ducalex avatar May 06 '24 14:05 ducalex

That is what I thought at first, but when I looked up the actual code in esp-idf, the only assignment I've found was the line I've quoted. Which would result in 512 or less.

Ah crap, yea I look down the code and indeed I can find them here:

  • https://github.com/espressif/esp-idf/blob/d4cd437ede613fffacc06ac6d6c93a083829022f/components/sdmmc/sdmmc_sd.c#L388
  • https://github.com/espressif/esp-idf/blob/d4cd437ede613fffacc06ac6d6c93a083829022f/components/sdmmc/sdmmc_mmc.c#L202

I think this might be a bug?? Maybe it supposed to be MAX() instead? Otherwise any newer/larger SD cards won't be read/write properly.

Meanwhile, I believe these sort of DMA buffers shouldn't be statically allocated, because we do have use cases where two or more SD cards are needed (that shares with the same SPI but different CS). We have PSRAM here so we have sufficient RAM for heap allocations anyway.

huming2207 avatar May 09 '24 00:05 huming2207

I agree that it shouldn't be static, it was hyperbole on my part! I also agree that running out of DMA-capable memory indicates bigger problems.

However I maintain that it would be beneficial to keep the buffer after it's been used for the first time, because it will undoubtedly be needed again so that space is already pseudo-reserved. Making it explicit is simply more efficient and improves resilience.

That being said, my proposal hinges on the fact that the buffer is small, 1KB or less. If there is a bug and the sector size can in fact be 4KB or even more then yes, it is obviously unreasonable to keep that around!

ducalex avatar May 09 '24 01:05 ducalex

However I maintain that it would be beneficial to keep the buffer after it's been used for the first time, because it will undoubtedly be needed again so that space is already pseudo-reserved. Making it explicit is simply more efficient and improves resilience.

Hmm yep, we do need to allocate at initialisation and free on deinit, not on every write op.

I will submit a pull request first to fix the MIN() issue, that will definitely become a big troublemaker on our firmware. Then I will see if I can work out a way to fix this heap allocation.

huming2207 avatar May 09 '24 04:05 huming2207

Thanks for the idea @ducalex! I agree that avoiding repeated allocations/deallocations could help with reliability.

Recently we have added a new dma_aligned_buffer member to sdmmc_host_t, which has a similar purpose: it provides a DMA-capable buffer which can be reused between operations. It's kind of misplaced in sdmmc_host_t, actually, as the usage occurs in the sdmmc component (protocol layer), not in the host driver, but that's something we have to refactor... I think we could reuse this dma_aligned_buffer for read/write commands as well.

igrr avatar May 10 '24 13:05 igrr

I am seeing the same (or very similar issue to the above) I am frequently getting errors when attempting to stat() files on the SD Card E (13:58:20.351) diskio_sdmmc: sdmmc_read_blocks failed (257)

as the OP describes this is ESP_ERR_NO_MEM

I have a esp32-s3 and 8MB of SPIRAM, I am also setting the option CONFIG_SPIRAM_MALLOC_RESERVE_INTERNAL=32768

From the application layer I cannot see how to resolve this, I have file operations which appear in bursts - I wish I understood the underlying cause of what is happening.

This is making the disk unusable as I cannot guarantee reads/writes

eroom1966 avatar Oct 03 '24 14:10 eroom1966

Further reading the call that is made is as follows heap_caps_malloc(512, MALLOC_CAP_DMA)

From the manual Use the MALLOC_CAP_DMA flag to allocate memory which is suitable for use with hardware DMA engines (for example SPI and I2S). This capability flag excludes any external PSRAM.

This means that the setting of CONFIG_SPIRAM_MALLOC_RESERVE_INTERNAL is irrelevent as it will not use the PSRAM anyway - is this correct ?

eroom1966 avatar Oct 03 '24 14:10 eroom1966

Hi @eroom1966,

This means that the setting of CONFIG_SPIRAM_MALLOC_RESERVE_INTERNAL is irrelevent as it will not use the PSRAM anyway - is this correct ?

The purpose of CONFIG_SPIRAM_MALLOC_RESERVE_INTERNAL is to set aside some part of internal RAM and only return it for allocations which explicitly ask for MALLOC_CAP_DMA-capable memory. It doesn't guarantee that DMA buffers will always be available, but generally helps by pushing some of the non-DMA allocations into PSRAM.

In addition to increasing this value, I would also recommend decreasing CONFIG_SPIRAM_MALLOC_ALWAYSINTERNAL, so that more of the smaller allocations are done in PSRAM by default.

To better understand the usage of internal and external RAM when this issue occurs I would recommend using the allocation failure hooks. You can define such a hook and use functions such as heap_caps_dump and heap_caps_dump_all to see utilization of various heaps at the point when the allocation fails.

igrr avatar Oct 03 '24 15:10 igrr

Investigating this a little further, it appears that when the call to heap_caps_malloc(512, MALLOC_CAP_DMA) fails to return a valid pointer, I still have a lot of heap available returned from heap_caps_get_free_size(MALLOC_CAP_DMA); in some cases 6kb, in which case the problem must be due to fragmentation

I tried a simple test to do the following in the functions sdmmc_read_sectors/sdmmc_write_sectors allocate a buffer (512) and return to a static pointer

    static void *static_buf = NULL;
    if (static_buf == NULL) {
        static_buf = heap_caps_malloc(512, MALLOC_CAP_DMA);
    }

I then never free this buffer, so it is permanently available, and not affected by any fragmentation. I never see any issues with errors of the form E (xxx) diskio_sdmmc: sdmmc_read_blocks failed (257) E (xxx) diskio_sdmmc: sdmmc_write_blocks failed (257) All of my file operations complete without IO errors

why is my heap usage so bad. All of my user tasks are using the PSRAM via calls to malloc(), so this must be due to tasks not under my control

this needs further investigation

eroom1966 avatar Oct 03 '24 15:10 eroom1966

I would suggest trying the heap debugging functions mentioned above — heap_caps_dump or heap_caps_dump_all — to see what is the maximum continuous block available for allocation in the internal RAM. If you see that this block is decreasing throughout the program execution, it would indeed point to fragmentation.

igrr avatar Oct 03 '24 19:10 igrr

Thanks @igrr I will take a look at this The difficulty I have is that all of my application code is using malloc() so that I only ever use heap from the external memory However, there are tasks I have no control which are effectively -

  • Wi-Fi APSTA
  • Sd card (spi)
  • A small number of user tasks that MUST place the stack on the internal memory heap due to restrictions imposed by specific API calls

so I am not sure what my approach would be if the issues are identified in these areas

more info to follow …

eroom1966 avatar Oct 04 '24 06:10 eroom1966

@igrr I have some results. The output provides the heap starting location, I am not sure how to correlate that to the block which is DMA_CAPABLE ? The interesting data is attached as a text file. Also, is there a way to know which task/function each of these allocations is attributable ? This information is useful, but I don't think it helps identify the underlying problem or a solution, I will attach my sdkconfig as well

thanks for your help ! sdkconfig.txt heap.txt

eroom1966 avatar Oct 04 '24 09:10 eroom1966

further update decreasing the value in CONFIG_SPIRAM_MALLOC_ALWAYSINTERNAL has made a massive impact - this appears to be the root cause. I now have this set to 512 and I am no longer seeing the internal heap reducing to nothing

eroom1966 avatar Oct 04 '24 15:10 eroom1966