nrf-hal icon indicating copy to clipboard operation
nrf-hal copied to clipboard

EasyDMA doesn't read from Flash

Open jamesmunns opened this issue 5 years ago • 23 comments

In general, EasyDMA does not work with items in Flash, e.g. a &'static str. From the datasheet:

EasyDMA is an easy-to-use direct memory acc s module that some peripherals implement to gain direct access to Data RAM. The EasyDMA is an AHB bus master similar to the CPU and it is connected to the AHB multilayer interconnect for direct access to the Data RAM. The EasyDMA is not able to access the Flash.

I did this accidentally, and did not receive any visible error when sending via UARTE. I think there would be 3 ways to address this:

  1. if there is some way to detect this error condition and return an Err
  2. debug_assert that the slice ptr is in the RAM region
  3. Do an explicit check, roughly if !( RAM_LOWER <= buf.as_ptr() <= RAM_UPPER ) { return Err(...); }

jamesmunns avatar Dec 24 '18 21:12 jamesmunns

Experienced the same just today. I am not sure what we do best (Error/debug_assert!). Since we already go by errors I think we can keep them even tho debug_assert!s would be better also for buffer length checking etc. I think.

Yatekii avatar Jan 26 '19 03:01 Yatekii

4th option: Copy to ram

When implementing embedded_hal traits where one do not have control over the external calling code it could be necessary to copy from flash to ram and then EasyDMA (which of course feels wrong)

I needed to solve this problem so I created some rough implementation as prof of concept:

https://github.com/simonsso/nrf52-hal/commit/cee4745e399d50b1122b0b32e1baefa486197df0

(It needs some clean up before I can turn it into a pull request)

simonsso avatar Jan 28 '19 06:01 simonsso

@simonsso true, copying to RAM is always an option, however I'd like to avoid excess copies when not needed. I think the approach you're taking in https://github.com/simonsso/nrf52-hal/commit/cee4745e399d50b1122b0b32e1baefa486197df0 is definitely interesting!

jamesmunns avatar Jan 29 '19 02:01 jamesmunns

I like the fact that your implementation throws an error when the do_spi_dma_transfer function detects an out of RAM bounds.

What I do not really like is the implicit copy. It's nice if you just expect the SPI bytes to be pumped out. But for some cases (for example SPI abuse for driving LEDs) this can kill you. At least on the nRF52832 there was a very tiny time-gap between SPI transactions, which really can kill you in terms of timings (we had some major issues, even when directly chaining the transactions via the PPI on-chip bus).

So I would rather have it such that we have a separate method to assemble that buffer which you can use in case of an erroneous transaction due to RAM boundaries. You could even chain that easily via results like this:

spi.write(&buf).or_else(|e| if let Error::DMABufferNotInDataMemory = e { spi.write(&copy_to_ram(&buf)) });

If you really always want to copy if things go south.

Maybe you all see that differently, but I would prefer such a variant.

Yatekii avatar Feb 06 '19 09:02 Yatekii

Please elaborate, I don't understand your concern. If you are using this trait and have your buffer in RAM there will be no copying. The check in do_spi_dma_transfer is a double check when using the traits where buffer copy have already been done - it is needed for the traits which do not yet implement this feature.

simonsso avatar Feb 06 '19 22:02 simonsso

No, actually the trait will do the copying for you: https://github.com/nrf-rs/nrf52-hal/blob/master/nrf52-hal-common/src/spim.rs#L95 . It is 100% copying ;) This happens implicitly inside the write function, which I highly dislike. The function doesn't tell you it did that copy. This can be bad for various reasons :)

Yatekii avatar Feb 06 '19 23:02 Yatekii

Yes at but at Line 84 there is the same check - The first part is the no copy version and the second part the copy version

https://github.com/nrf-rs/nrf52-hal/blob/e438a5a2a674b21e8f5162eb5da49330d0a55771/nrf52-hal-common/src/spim.rs#L84

-- at least that is what the code is intended to do

It must be done silently inside this write function otherwise this trait will not work for code using embedded_hal having the buffer in flash -- to rewrite all such code is not posible

simonsso avatar Feb 06 '19 23:02 simonsso

Yes I know it doesn't work from flash. And no, it doesn't have to happen silently. you can easily have it to be requiring a conscious copy before the transfer and otherwise it'll error.

Yatekii avatar Feb 07 '19 07:02 Yatekii

@Yatekii How do you mean we could we solve this in another way for a embedded_hal trait user? -- the same code must work on several platforms?

If the buffer is required to be mutable it will end up in accessible memory without a copy - but this would not implement the traits as specified, or is there a way I am not aware of?

simonsso avatar Feb 07 '19 09:02 simonsso

Maybe to help you unstand why I don't like the implicit copy inside the ::write() function here are several downsides that come with it:

  • You automatically batch the transfer into several smaller transfers of constant size (255 bytes at the moment). This does not do the same if you transferred one batch, timing wise. There will be a small gap between the transferred batches. This can kill you sometimes. This might not be an issue on the 832 version as you se the same temporary buffer size as the easy dma max transfer size. But on the 840 version it is.
  • The copy can be a significant overhead that will occur with every transaction. Imagine you transfer static data in periodic manner (for whatever reason) then you might rather copy the data just once at the start and not with every transaction.
  • I think there was one more argument I forgot atm. Will edit if it comes to my mind.

Above I already gave you an example how this could be done explicitely:

spi.write(&buf).or_else(|e| if let Error::DMABufferNotInDataMemory = e { spi.write(&copy_to_ram(&buf)) });

This is NOT a problem that the trait has/must fix. All the others don't have to copy either. Actually, I would argue that it's absolutely not in the traits spirit to copy implicitely.

So as shown in the code sample, I would simply separate the copy code (also the batching code for above reason) form the ::write() function into another function, possibly something like [u8; SIZE] copy_to_ram(&[u8]).

Yatekii avatar Feb 07 '19 09:02 Yatekii

In general, I think there are two opposing thoughts here:

  1. Do we want things to "just work", in a perhaps suboptimal way?
  2. Do we want things to refuse to work in a suboptimal way?

In general, IMO the users of the embedded-hal traits cannot expect that everything will work optimally, as optimal generally requires knowledge of the platform. Our interface via the embedded hal traits is also constrained, it is difficult to provide all options through a single standard interface.

In terms of performance impact, at 64MHz, copying 255 bytes takes about 64 clock cycles, plus a little bit more to retrigger the transaction. Let's call it 128 clock cycles. This is roughly the amount of time taken to send 1 bit at 500khz, or 1 byte at 4MHz. This is certainly nonzero, though likely not significant for most use cases.

@Yatekii do you think people with high performance needs are likely to use the embedded hal traits, rather than either: using the nrf crate directly, or creating a wrapper type that implements the behavior they need? I feel like most people will investigate the actual source code when their hardware doesn't work, and if they realize that they should pre-buffer in RAM, they will remove the unexpected overhead. I also feel like most people will never notice this, or more likely notice the RAM used for buffering first.

jamesmunns avatar Feb 07 '19 13:02 jamesmunns

Sure, it's hard to build a working has-it-all interface. I don't expect that to be a thing. But when I give a buffer to a write() function, I actually expect it to transmit every byte in exactly the same fashion.

I agree that the performance overhead should be negligible, still if it can be avoided, I like that option.

The trait is intentionally designed in such a way, that it can return a custom Error, so why not use that feat here? I really don't think it is asked too much to copy manually and just return an error if someone forgets that. I mean one of the cooler features are built in Result types that you HAVE to check to avoid a warning. Whenever I have to do that, I actually look in the docs what kind of errors could happen, which will actually tell me there is a caveat. If you just use the ::write() function as if there is no caveat, the general user will never know there is one. Especially since there is so many differences in the different chips. I like steering through meaningful errors more than hoping the user (which could potentially be myself) is not lazy and reads all the manuals :)

The problem is that investigating a seemingly trivial problem always takes the most time in my experience. I am not so much concerned about performance for the general case. But there is some cases (I had some of those) where it mattered. And writing your own interface even tho there is two unfavorable ones already seems kinda silly to me if we can prevent his simply by explicitly calling a second function :)

One real good example where I would mind all of this: I wanted to build my Neopixel driver on the embedded-hal traits so any chip with an UART/SPI can use them. This would most likely not possible with this implementation :) It's not exactly fair to field this example here tho, as my driver needs non blocking DMA interface to enable the high throughput whilst still being able to calculate animations, but it's a real usecase I know of.

I think both approaches are totally valid and I wont be pissed if either one gets chosen, but my vote goes to the explicit one.

Yatekii avatar Feb 07 '19 14:02 Yatekii

@Yatekii You make a very good point. I'm undecided right now, but you've almost got me convinced.

@jamesmunns

In general, I think there are two opposing thoughts here:

  1. Do we want things to "just work", in a perhaps suboptimal way?
  2. Do we want things to refuse to work in a suboptimal way?

I think an argument can be made that a HAL, as the lowest-level safe API, shouldn't hide these things from the user. Instead it should make any potential problem explicit, and leave the "just works" to higher-level layers.

do you think people with high performance needs are likely to use the embedded hal traits, rather than either: using the nrf crate directly, or creating a wrapper type that implements the behavior they need?

I think that depends on whether the people with high performance needs are writing a library or an application. I think it's reasonable to assume that a driver with very precise timing needs would be written using embedded-hal. I also wouldn't be surprised if some poor soul would be driven half-mad by weird timing issues, before thinking to look at the right code and discovering the implicit copy.

So yeah, I'm still undecided, but definitely leaning in @Yatekii's direction.

hannobraun avatar Feb 07 '19 16:02 hannobraun

Right now in SPI driver there are both approaches the low level write function - (the one which takes a pin as argument) - It will fail with the error if buffer is not in RAM and the embedded_hal trait which copies the buffer.

For me the case is very clear - when working with code from other sources implementing the same interfaces then it MUST work for the same for all platforms and all drivers. If in every driver I bring in to my project I must check not only what traits they need but also if they handle custom errors the intended way or not incompatibility will very quickly be an unmanageable problem.

I was intending to implement the copy buffer for all functions but as I have been putting my arguments down in writing in this thread I see they are only important for the external traits in for instance embedded_hal. - @hannobraun I think embedded_hal should be this "just works"-higher level layer.

@Yatekii wrote:

* You automatically batch the transfer into several smaller transfers of constant size (255 bytes at the moment). This does not do the same if you transferred one batch, timing wise. There will be a small gap between the transferred batches. This can kill you sometimes. This might not be an issue on the 832 version as you se the same temporary buffer size as the easy dma max transfer size. But on the 840 version it is.

* The copy can be a significant overhead that will occur with every transaction. Imagine you transfer static data in periodic manner (for whatever reason) then you might rather copy the data just once at the start and not with every transaction.

Correct only if the buffer is readonly in flash - this is not done if the caller provides a buffer in RAM then there is no copy and the buffer for 840 will be split only when it reaches the 16 bit limit.

simonsso avatar Feb 08 '19 00:02 simonsso

I am aware of all of this :) I never talked about any other case than the one where flash contents are copied to RAM.

You mention that all HAL implementations should behave the same way. I agree completely, but the current implementation with the hidden copy fails to do exactly that. It silently adds an overhead and potentially different timing behavior when it decides it's necessary without ever notifying the user.

Having a custom error was intended by the writers of the trait. I think it should be used. If we want perfect simplicity with random implicit behavior, we chose the wrong language I think. Also, if you write a lib that can potentially fail exactly that write, you can simply wrap the generic error type in your own error type and hand that out. And the final user of your lib gets perfect error handling. I don't even think there is a simplicity problem :)

What I mean is this:

pub enum LedError {
    SPIError(SPI::Error)
}

pub fn can_fail(spi: &mut SPI, buf: &[u8]) -> Result<(), LedError> {
    // do custom stuff
    spi.write(&buf).or_else(|e| LedError::SPIError(e))
}

That's a very nice way to handle things and not make them complicated.

Yatekii avatar Feb 08 '19 08:02 Yatekii

I still agree with @Yatekii. I think the fact of the matter is, we can't make it behave the same on all platforms, because the platforms have different capabilities. The implicit copy does not make it behave the same, because on the level we're working on, timing matters.

So the question simply becomes, which problem are we going to cause our users?

  • Option A: Timing issues, due to the implicit copy. Most won't notice, but some will have a very hard time dealing with this.
  • Option B: An explicit error. Many users who see this would have been better served by the implicit copy, but it's going to be a straight-forward problem to fix.

My vote is for option B, the explicit error. In my opinion, driving a few people crazy to shield a larger group from straight-forward problems is a bad trade-off.

hannobraun avatar Feb 08 '19 12:02 hannobraun

I prefer Option A.

I write use some code which I want to run on several platforms and it also uses third party drivers for some hardware components. Those drivers use embedded_hal traits. I cannot put hardware dependent code where the trait is called.

The spi.write(&buf).or_else(|e| if let Error::DMABufferNotInDataMemory = e { spi.write(&copy_to_ram(&buf)) }); would have to go into the calling code and that is done in the 3rd party driver.

With option B, the 3rd party driver will receive an Error when writing some of its read-only data to SPI but this will be detected in a place where it cannot be handled. I don't see how I could get those drivers working without allowing write with buffer in flash.

simonsso avatar Feb 08 '19 14:02 simonsso

@simonsso

The spi.write(&buf).or_else(|e| if let Error::DMABufferNotInDataMemory = e { spi.write(&copy_to_ram(&buf)) }); would have to go into the calling code and that is done in the 3rd party driver.

I agree that this isn't the solution. A driver would need to pass that error to its back calling code. The solution would then be to copy the data to RAM in the first place, wherever that is appropriate. This could be in the driver code or in the calling code.

A driver that receives a slice from its calling code could also decide to implement the same implicit copy that is currently implemented in our code. The difference would be that the driver author would have knowledge of the use case and its timing requirements, which we don't (and can't) have.

I realize this sucks, and that it has a negative impact. However, that negative impact takes the form of an obvious error. Either I can fix it in my own code, or I at least I know where to open an issue or send a pull request.

With the implicit copy, the fix for the problem would be exactly the same, except it would be non-obvious and require hours, days, maybe even weeks, of debugging. Granted, it would affect a smaller number of people.

You also said the following earlier, which I think I have an answer for now:

I think embedded_hal should be this "just works"-higher level layer.

I don't think it can be. It operates on the level of peripherals and it can't know whether those peripherals are used for something non-critical, or to bit-bang a highly time-sensitive protocol. A higher level layer that "just works" would have to be above that, in a driver API or something equally high-level.

hannobraun avatar Feb 09 '19 11:02 hannobraun

Maybe @jamesmunns @jscarrott & @wez can provide their opinion too. I think #50 depends on this too (does implicit batching).

Yatekii avatar Feb 11 '19 21:02 Yatekii

@Yatekii @hannobraun I haven't had a chance to look at this project for a few months (Hopefully work will quiet down again soon) so I don't feel I can really comment on the specifics of this issue. But I think I can talk generally.

I do think this raises an interesting architecture issue, we have a library that runs on multiple platforms but can't always behave the same way on all platforms. This can't be the only place this sort of issue will arise and I would prefer a consistent approach. It might be simple in this case to add an implicit workaround but it might not be in other cases and so I would like to see all cases made explicit.

As an aside I can't remember being upset at a library for forcing me to deal with an error case, but I definitely remember the times a library has tried to be clever and implicitly 'help' me.

jscarrott avatar Feb 13 '19 13:02 jscarrott

@nrf-rs/nrf52 do we maybe want to bring this up to the embedded wg, or maybe the embedded-hal team? I think this is probably worth discussing for the proposed "patterns book".

I think the approach made by @simonsso is very good if we want to take a "works at all cost" approach (though, I think that is too strong of a wording, you can get optimal behavior if you know to send data out of RAM, though I admit it is also possible to get wrong and have no notification).

It might be good to discuss with the embedded-hal team whether the goal is portability at all costs, or portable with optimal behavior as the primary goal.

jamesmunns avatar Feb 13 '19 13:02 jamesmunns

I found this issue after some confusion when writeln!(uart, "Hello World") worked fine to print over the UART, but uart.write("Hello World"); seemed to spew out garbage (although both functions seemed to execute successfully).

Summary

I think the two main points brought up in this thread are both equally valid (summarized below):

  1. The embedded-hal should provide zero-cost abstractions (or near-zero) to provide functionality.
  2. The embedded-hal should provide a common API that higher-level drivers can use.

To these points, issues have been brought up that seem contradictory to the above points:

  • (1) having an implicit copy from flash to RAM may trigger timing and performance inefficiencies.
  • (2) Not copying buffers provided to the EasyDMA can potentially break higher level drivers, which should not need to know that the EasyDMA cannot transfer from flash, since they are (ideally) platform-independent.

Example

Take a hypothetical example of a sensor that communicates via UART. This driver uses a 4-byte sync word before each command. For efficiency, the device driver crate stores this 4-byte sync word as a constant and sends it before each command to the device. The sync word is stored in flash, but commands are RAM-based.

When an end-user comes and utilizes the provided device crate, it won't work without a copy from flash to RAM, so the user has to clone the device crate and modify it specifically for use with this platform, which breaks the modularity that crates provide.

Proposal

Performing implicit copies without the end-users knowledge should be avoided, since this would not be a zero-cost abstraction. Instead, I propose that we develop a cargo feature within the nrf52-hal that will allow the user to specifically allow implicit copies from flash to RAM if data is provided that exists in flash. If the feature is not enabled, the EasyDMA driver will return an Error whenever data contained in flash is specified. This has the benefits of:

  1. By default, users will be informed that the EasyDMA driver cannot transfer data from flash. No implicit performance hits. The user is informed of the problem and the error can inform them of the optional feature.
  2. When the user enables the feature, they are acknowledging that data transferred via EasyDMA may be copied from flash into a RAM buffer. The copy is now completed with explicit consent from the end-user.

ryan-summers avatar Nov 10 '19 09:11 ryan-summers

Makes sense to me.

jonathanpallant avatar Nov 11 '19 09:11 jonathanpallant