fbcp-ili9341 Add a filesystem API

Now that I have a working copy of this library (#171), I'm impressed. I'd like to use it in one of my projects, but there are a few missing things.

I'd be interested in implementing these myself and contributing back, but I want to discuss them first to make sure they're in-keeping with the philosophy of this project (and to catch any known gotchas early).

Provide a filesystem API

Right now the kernel module registers an init and exit method, but exposes no device API on the filesystem. I could envision a number of useful options which could be exposed this way:

trigger vsync NOW (compatible applications could bypass the famous https://github.com/raspberrypi/userland/issues/440 by explicitly triggering a vsync when they are ready for it)
- clearly this would only work if there is only one application rendering at a time, but I think that's OK
- the library could still include a slow frame timer so that if it doesn't get a vsync event after (say) 0.1 seconds, it triggers anyway and goes back to its regular polling mode (or whatever it was compiled with). That would keep it usable with applications which don't call the vsync API. When the vsync API is called again, it switches back to waiting for the next vsync.
- this would be an optional compilation flag
reconfigure currently static config, such as:
- target framerate (no need for this to be hard-coded; it could be changed dynamically by applications or the user)
- change stats display (compilation flag would still decide what's possible, but now the user would also have the choice to turn it off at runtime if they want)
query stats programmatically (e.g. if only achieving ~30fps, an application could throttle itself and explicitly target 30fps for a smoother experience)

Maybe there are better ways in linux to allow user-space applications to communicate with a kernel module?

Allow rendering without DispmanX

A filesystem write command could allow applications to send data directly to the kernel module for rendering. This would be better than having applications include (stripped-down) code from this repository, which is the current recommendation in the readme. It may also open the door to exposing some more high-level capabilities, such as explicitly updating a particular region, or using hardware scrolling, etc. which would be a huge benefit for some applications.

This would either be a separate compilation option (which removes frames and DispmanX entirely from the build) or would be a switchable runtime mode (switch to API rendering, switch back to DispmanX rendering). I could envision auto-switching like with the vsync suggestion above, but that probably wouldn't be great if the app runs a little slow and the user sees flashes of DispmanX content as it switches back and forth.

Another option would be to continue using DispmanX but at least allow specifying alternative framebuffers or surfaces, so that it doesn't have to mirror the HDMI screen.

Sorry for writing such a large issue, but it's all connected and I thought it would be better than spamming the issue tracker! I don't intend to implement everything here, but I'd like to get things going with a few of these. Thoughts from the maintainer?

Oct 09 '20 00:10 davidje13

I started poking around and realised that the kernel sources aren't used by the default build, and seem to be early-stages judging by their short git history. That prevents me using root_device_register &co.

I'm wondering if the better option is:

try to find a way to add communication to the userspace program (maybe with a socket bound to localhost? That's probably the most common thing for many-to-one userspace program communication in linux. Raises questions about protocols);
or get the kernel stuff working and continue with the original plan.

How complete is the kernel module? I doubt I'd be able to help much there if it still needs a lot of work.

Oct 09 '20 16:10 davidje13

While there does exist a kernel module in this tree, it is not actually currently used for anything (the code has been left there in case it proves useful later - I believe kpishere's fork uses it though).

I did for a good while try to prove that a kernel space driver would be better, or at least as good as a userland driver, but eventually had to give up. The benefit of a kernel driver would be that it would be able to directly allocate and map between nonvirtual DMA peripheral visible memory, which would avoid one memcpy by allowing the framebuffer diffing code to directly write out to DMA visible task memory. Also from kernel side it would be possible to cleanly allocate DMA channels to avoid a chance of conflict.

However the drawback was that even with memory mapping framebuffers over to userland, the latency of signaling from userland over to kernel side was way too slow. Basically a message from one process to the another seems to have at least 2-3 msecs of delay, which is too much. On Pi Zero, one has to send the message, then yield the process, then wait for the other process to take over, which is easily a 10msec+ trip at best.

Hence in the end, the fastest version of the driver ended up operating directly in userland.

A filesystem device API would allow perhaps some flexibility of driving from other applications, and avoid DispmanX, but I have large doubts that the performance would be good.

When you write that a program would be able to trigger a vsync, you probably mean triggering a frame flip/update to the display. (because SPI displays drive their vsync signals autonomously based on their own clock circuitry - this vsync signal is unfortunately not even observable outside the SPI display)

In the documentation I mentioned a section

Port fbcp-ili9341 to work as a static code library that one can link to another application for CPU-based drawing directly to display, bypassing inefficiencies and latency of the general purpose Linux DispmanX/graphics stack.

I believe that would be the best way to get best performance with the driver if the goal is to skip DispmanX. For applications that do not do GPU based rendering, this would have the ability to give < 1msec processing delays on the display updates. Then an application could link to a static library version of fbcp-ili9341 to get direct access to the SPI display.

Maybe there could be some kind of device file based enable/disable fbcp-ili9341 operation control, which would allow one to have a global system fbcp-ili9341 running that is based on DispmanX, and then an app specific library could disable the system one when it wants to take over. Flipping such a enable/disable bit would not need to be low latency so wouldn't have that issue.

The kernel module is out of date, but probably not that horribly obsolete. I think it is likely not going to be performance enough, so not worth the effort.

Something that I have been thinking about has been to create a kernel driver stub just to allocate out DMA channels (only kernel can do that), to solve the DMA channel conflict issue. Though given that one tends to control all the software running on the Pi, manually assigning channels has not been a big hurdle (at least for me), so haven't bothered..

Overall I think DispmanX is the current biggest/weakest link. I wish there existed a better display API on the Pi to avoid those issues.

Oct 09 '20 17:10 juj

hmm, I see the problems.

First, yes, when I say "vsync" I intended "framebuffer swap" or more specifically: "send the current data to the device now please". Is there a good name for it when it's a push operation rather than a pull?

I'm surprised you say it's 10ms+ of latency to communicate between the processes. Frankly that's huge and makes me wonder how the Pi would be able to get anything at all done, since linux is full of processes talking to each other all the time. I was somewhat hoping that it would be possible to have something like this:

Application renders frame to GPU
Application sends "vsync" signal
fbcp picks up vsync signal and snapshots current display data
fbcp performs diffing and starts sending data
Control returns to the application
Application renders the next frame (while SPI transfer is ongoing in the background, since it's mostly waiting from the CPU's perspective)

But if it were all handled by a single process I assume it would have to wait for all the display data to be sent before it can continue with the application logic, which would be a huge bottleneck. Maybe I've misunderstood something about how that can work?

If this were to go the single-process (library) approach, I'd suggest a dynamic library. That way an application can be compiled without knowing which exact device it will target (i.e. no need to specify pinouts when compiling the app), and the library can be compiled with that info as a .so. That means there would need to be a fixed API/ABI for the core functionality (diffing, pushing to device) regardless of compiler flags.

Is there a particular reason you prefer a static library? (AFAIK dynamic only incurs a small overhead at startup, but no additional overhead once the application is running, except the inability to inline small function calls which shouldn't matter here).

Once a library (of either kind) is extracted, the fbcp-ili9341 program would just push frames from DispmanX to the library (being responsible for pulling frames, framerate guessing, etc.). I'm not sure where the stats, battery indicator or keyboard checking would live (maybe stats would live in the library, and battery indicator would be split so that display lives in the library while pin polling lives in the fbcp-ili9341 program? with a similar split for keyboard/backlight too).

It leaves the problem you mention about how to switch between applications (including back to DispmanX rendering). But if it's not a kernel module I don't have any suggestions there.

Also if I pick up the library approach, I'd probably start by just scrapping the existing kernel work. Is that OK?

Oct 09 '20 18:10 davidje13

I was somewhat hoping that it would be possible to have something like this:
* Application renders frame to GPU

* Application sends "vsync" signal

This is the first spot where things would go wrong. An application does not render a frame to the GPU, and when it's done, be able to send a vsync signal. The issue here is that applications submit GL commands to be asynchronously executed by the GPU. While the GPU is working on them, the application will already proceed to process and render the next game/application frame. I.e. the app cannot afford to wait for when a frame is done to send a vsync signal; that would cause a CPU-GPU pipeline starvation. The GPU will post a signal when a frame has finished rendering. This signaling occurs currently via DispmanX.

If application did this, it would have to forego any asynchronous GL/GPU rendering, so that it immediately/synchronously knows that it has plotted all pixels on the screen.

I'm surprised you say it's 10ms+ of latency to communicate between the processes. Frankly that's huge and makes me wonder how the Pi would be able to get anything at all done

Don't hang me on that quote, though the Pi Zero is somewhat of a stuttering and slow mess. Pi 3 does not naturally have as much of an issue, because it actually has several hardware cores.

But if it were all handled by a single process I assume it would have to wait for all the display data to be sent before it can continue with the application logic, which would be a huge bottleneck.

The key here is to use DMA. Fbcp-ili9341 does not need to pause to wait for all display data to be sent, but it can issue a full frame's worth of display updates in one go over to the SPI bus. Currently there is a limitation though that because atomic DMA chaining is not implemented, a second frame cannot be queued once a first frame is still in-flight. (I am not sure if BCM2835 hardware can support it, the register fields imply that the philosophy would be to allow atomic DMA chaining, but there is no official doc mention and have not seen anyone do it anywhere) That would cause some pauses in a single-process program, that the current dedicated process driver does not have.

Apart from that, the above logic is already what fbcp-ili9341 does, except that instead of 'Application sends "vsync" signal', it is DispmanX that sends the vsync signal.

The static vs dynamic library question is probably not a big issue, it could be compiled in either mode.

Once a library (of either kind) is extracted, the fbcp-ili9341 program would just push frames from DispmanX to the library

The core idea for linking as a library was the intent to directly hook to an application that renders in software (like e.g. a NES game emulator or other software renderer) without needing to round trip the frames via a GPU. If a linked library approach would still use DispmanX, I am not sure what the benefit of the whole exercise would be - the current driver program today would be superior to that.

But if a program was not really using DispmanX at all in the first place, then a linked library approach would allow skipping the app->DispmanX->fbcp-ili9341 round trip, and allow directly going app->fbcp-ili9341.

However linking as a library for a software renderer would have the drawback of losing GPU screen scaling for e.g. simultaneous 1080p HDMI output while mirroring the output to a 320x240 display, if one wanted that. DispmanX can do that automatically. That is one limitation to keep in mind, which would limit the usefulness of the approach.

Also if I pick up the library approach, I'd probably start by just scrapping the existing kernel work. Is that OK?

I think the core question to look back at is what problem are you looking to/want to solve?

Oct 09 '20 19:10 juj

The core idea for linking as a library was the intent to directly hook to an application that renders in software (like e.g. a NES game emulator or other software renderer) without needing to round trip the frames via a GPU. If a linked library approach would still use DispmanX, I am not sure what the benefit of the whole exercise would be - the current driver program today would be superior to that.

Sorry I wasn't clear. I agree; the idea is for applications to link directly. I just meant that in order to maintain existing functionality (as another way of running it), that's how I'm envisioning it. It would mean that anybody who still wants to mirror HDMI could run it in that mode, with presumably very little extra overhead over the current implementation. Both would still ultimately just call the common library.

I think the core question to look back at is what problem are you looking to/want to solve?

The main limitation I'm looking to get around is being able to show one thing on the LCD and another on the HDMI screen. As I mentioned, the readme currently recommends including this code in the application to do that, which isn't something I want to do (not least because it makes updating it difficult). So my first idea was to make the runtime more controllable with the filesystem approach, but now I'm thinking pulling out a library will work for my use case. The main thing I want to achieve though, is to have whatever I do make it back to the upstream repository (this one) so that I can keep using future updates without lots of hassle. Hence the thinking around how these changes can still maintain current behaviour.

Oct 09 '20 19:10 davidje13

The main limitation I'm looking to get around is being able to show one thing on the LCD and another on the HDMI screen.

Ok, now I understand.

Thinking about this, if the goal is not to get rid of DispmanX, I would say the simplest method might be to expose a shared memory block between processes (https://stackoverflow.com/questions/5656530/how-to-use-shared-memory-with-linux-in-c/5656561 ), where fbcp-ili9341 would expose a shmem buffer that a user process can open and to obtain a pixel framebuffer from; and then signal that a frame is done by a mutex - much like you mention.

This would still limit to only being able to do software rendering for the SPI display, there is no way that I know of that would allow the GPU to render different content to two displays at once.

Oct 10 '20 06:10 juj

An update on this:

I'm working on this in https://github.com/davidje13/fbcp-ili9341/tree/filesystem

My current target is to split a library out of the standalone application (i.e. #90), and I have been gradually teasing apart the modules in order to achieve this. Unfortunately that means that quite a lot has changed already. It would be good if you could review some of the commits so far to make sure it would be acceptable for re-merging (work-in-progress obviously, but still compiles and runs fine, at least with the config I'm using. In theory no functionality has been removed, but given the large number of compilation options I can't guarantee I haven't broken something, e.g. by removing a transitive include).

I'm having some trouble with the statistics and gpu modules, which seem to be quite pervasive (in particular the links between gpu and dma are pretty tight, and statistics has references in both directions which makes things tricky).

Some accomplishments so far:

diff has an abstracted (single-function) API for diffing, manages its own initialisation, and is independent of the source of data (no dependency on gpu). Because of the way I initialised it, it might reserve slightly more memory than before for the diff rects, because it now reserves space for the entire SPI screen size, rather than skipping any aspect-ratio-induced black bars as before. This is so that it doesn't need to worry about dynamic changes to aspect.
low_battery has been split into the pin polling part and the display part. The polling lives in the standalone app, and the display lives in the library.
all display-specific files live in library/displays now, which makes them easier to find, and easier to ignore when not working on them!
gpio control has been broken out into a separate header, so that files do not need to bring in the whole spi interface just to read pins (i.e. battery monitoring). This is similar to the existing abstraction in tick.h. I think these GPIO functions could be exported as part of the official library API.
throttle_usleep is an alias for usleep unless NO_THROTTLING is used. This seems to be the same functionally as before, but makes it much clearer which usleep calls can be omitted and which cannot (e.g. protocol-required sleeps). Previously it was only possible to tell which usleep was being used by looking at imported headers (which also made it quite fragile).

The bulk of the remaining work is to divide out chunks of fbcp-ili9341.cpp into library API functions, and figure out what to do with the gpu and statistics references.

Oct 22 '20 20:10 davidje13

Thanks for the update.. there are so many changes that unfortunately reviewing this may become a challenge. At a glance the changes seem solid, but I fear already now that I will have hard time to do a decent review in detail since there is so much that changes.

I wonder if it would be possible to separate the pull requests into individual logical parts that would help out the review process? That way we could take a look at every conceptual change on its own. If it all becomes a large code drop, I worry that I will never find the time to go through it all?

Oct 23 '20 08:10 juj

Yeah I know what you mean. I've been on the receiving end of things like this before.

Each change is fairly isolated within a commit already, and the project builds and runs fine after every commit, so it should be quite straightforward to do this on a commit-by-commit basis. Once I'm done and things have settled down, I'll try rewriting the commits to make the history as simple as possible (e.g. I've moved files around a couple of times; I'll change that to a single reorganisation commit). I can make separate PRs for some of the smaller things (e.g. throttle_usleep, diff isolation), but I think it will still be a fairly large PR for the library extraction itself.

Oct 23 '20 09:10 davidje13

hey @davidje13! I've recently been working on a similar issue, I want to show a tty with minimal overhead on a small TFT display. I figured using fbcp with the LXDE terminal would be my best bet! =D Can you help me understand how to use the fork to show an application at the small TFT? I can try to modify the LXDE-terminal app's source to work with this library. :D

Jan 22 '21 15:01 tm9k1

@tm9k1 sadly as is often the way, other projects came along and I didn't finish this work. I can't remember exactly how far I got, but my initial target was to break the code up.

You should find that library/ contains the important code for handling frames and communicating with the device. It isn't actually built as a library yet (I can't remember if I finished removing the dependencies on the non-library code or not; I know I had a problem with the statistics handling and I think there might still be some framebuffer dependencies).

The intent was that the fbcp_* functions would be the public API (see that header file for some documentation, but notice that there is no function yet for actually providing the frame data or a diff).

standalone/ contains the remainder of the code and can be used as a reference to see how to call the fbcp_* functions (see the main loop). This includes lots of stuff you wouldn't need if calling the functions directly (framebuffer polling, fps prediction, battery/keyboard polling, etc.)

You can pretty much see the point I reached for extracting code here: https://github.com/davidje13/fbcp-ili9341/blob/filesystem/standalone/fbcp-ili9341.cpp#L329 - after that, there is a lot of diff-handling code and pixel-pushing. Eventually the intent was to extract that into library functions somehow, but it's not a trivial extraction!

So for now, if you want to use this work, you'll need to compile the library sources into your own project, and figure out the bits I didn't get around to. I hope to revisit this eventually but that might be some time away. If you make progress it would be great if you can share it!

Jan 22 '21 20:01 davidje13

fbcp-ili9341 fbcp-ili9341 copied to clipboard

Add a filesystem API

Provide a filesystem API

Allow rendering without DispmanX

fbcp-ili9341
fbcp-ili9341 copied to clipboard