go-play icon indicating copy to clipboard operation
go-play copied to clipboard

[Discussion] Reduce Tearing in NES

Open rickyzhang82 opened this issue 6 years ago • 51 comments

I want to reduce tearing problem in NES. I copied our discussion from odroid forum to here

How it works now

The way the display system currently works:

  1. emulator finishes a frame on core 0
  2. frame is copied and the copy is sent to core 1
  3. emulator works on next frame on core 0

meanwhile on core 1:

  1. a frame is received from core 0
  2. a group of scanlines is processed (palette and scaling).
  3. the group is sent to DMA.
  4. the next group of scanlines is processed while DMA is in progress.

meanwhile on DMA:

  1. data is received by the DMA engine and is fed to the SPI hardware
  2. DMA notifies completion.

There are two frame buffers, two scanline buffers of 3 lines, and multiple DMA buffers.

At 40Mhz SPI, there is 40,000Kbps / 8 bits = 5KBps. A frame is 320 * 240 * 2 bytes / 1024 = 150KB. 5KBps / 150KB = 33.3 fps. However, because there is protocol overhead, the achieved frame rate is less (closer to 30fps). The LCD updates at 70fps (61 is the slowest but looked worse). 70fps / 30fps = 2.3 refreshes for every frame displayed.

LCD controller specs

IL9341 specs

Other's improvement on IL9341 over SPI

fbcp-ili9341 repo

My proposal

  1. Use two frame buffers to do swap instead of memcpy
  2. Find minimal different rectangle between two frame buffer and send the delta over SPI
  3. Use Good old interlacing is added into the mix: if the amount of pixels that needs updating is detected to be too much that the SPI bus cannot handle it, the driver adaptively resorts to doing an interlaced update, uploading even and odd scanlines at subsequent frames. Once the number of pending pixels to write returns to manageable amounts, progressive updating is resumed. This effectively doubles the maximum display update rate.

rickyzhang82 avatar Jun 30 '18 11:06 rickyzhang82

@OtherCrashOverride It seems that you disable double buffering inside nofredno. Why?

https://github.com/rickyzhang82/go-play/blob/master/nesemu-go/components/nofrendo/vid_drv.c#L382-L390

#if 0
   back_buffer = bmp_create(width, height, 0); /* no overdraw */
   if (NULL == back_buffer)
   {
      bmp_destroy(&primary_buffer);
      return -1;
   }
   bmp_clear(back_buffer, GUI_BLACK);
#endif

I will restore it and use swap instead of memcpy.

rickyzhang82 avatar Jul 02 '18 03:07 rickyzhang82

It was likely disabled during early development. I am not currently aware of any issue preventing its use.

OtherCrashOverride avatar Jul 02 '18 08:07 OtherCrashOverride

I'm not saying there is an issue in your modification. I just wonder why you are doing this. I understood that you can NOT dynamic allocate memory in Arduino. Perhaps the same applies to ESP32. Thus, you hack bmp_create. In any case, there is no trace in your change. That sucks.

My goal is to add back a proper double buffering implementation so that I can experiment interlacing and progressive update.

I found the original source nofrendo where double buffering implemented in NES emulator application level. It swap primary buffer and back buffer when flushing

https://github.com/rickyzhang82/nofrendo/blob/master/src/vid_drv.c#L359-L362

   /* Swap pointers to the main/back buffers */
   temp = back_buffer;
   back_buffer = primary_buffer;
   primary_buffer = temp;

I will allocate two buffers in advance for primary and back buffer and swap them in your custom_blit function. More advance experimental stuffs will be done there later.

PS: Don't blame the source mixed with tab and space. The original nofrendo code use 3 spaces as indentation. It is nice and clear. We should fix it in your upstream. It is very irritating to me.

rickyzhang82 avatar Jul 02 '18 09:07 rickyzhang82

I did an experiment by restoring primary buffer and back buffer. Swapping them after custom_blit function.

But I saw more serious screen tearing problem than before.

Video

I still tried to figure it out why. But cloning a copy of frame buffer by memcpy before sending it to video queue seems to solve the problem.

rickyzhang82 avatar Jul 04 '18 04:07 rickyzhang82

@OtherCrashOverride I got stuck in debugging serious tearing problem due to enabling explicit double buffering (see my previous Youtube video)

I made two changes:

  1. Restore primary buffer and back buffer. The primary frame buffer and the back frame buffer swaps to each other after custom_blit is called.
  2. Remove memcpy in custom_blit and directly send frame buffer for SPI data transfer.

Reading through ili9341_write_frame_nes, this function won't return until SPI transaction is completed. This is done by polling spi_device_get_trans_result.

Between xQueueSend from custom_blit function and xQueueReceive from video task, they should synchronize that xQueueSend have to wait until xQueueReceive clear the slot.

This alone enforce that before buffer swapping is done, current frame buffer data (either primary or back) has been sent over SPI. This guarantees that NO overwrite in frame buffer during transferring.

But the test show otherwise. Is it possible that the completion of transferring SPI data doesn't mean that LCD finish display the result?

See snippet of custom_blit and video_task.

static void custom_blit(bitmap_t *bmp, int num_dirties, rect_t *dirty_rects) {
   if (bmp->line[0] != NULL)
   {
      xQueueSend(vidQueue, &(bmp->line[0]), portMAX_DELAY);
      ESP_LOGD(TAG, "xQueueSend bmp(%p) to queue.", (void *)bmp->line[0]);
   }
}

and

static void videoTask(void *arg) {
   uint8_t* bmp = NULL;

   while(1)
   {
      BaseType_t hasReceivedItem = xQueuePeek(vidQueue, &bmp, portMAX_DELAY);
      if (bmp == 1)
         break;
      if (hasReceivedItem)
      {
         ili9341_write_frame_nes(bmp, myPalette);
         ESP_LOGD(TAG, "Sent bmp(%p) over SPI.", (void *)bmp);
         odroid_input_battery_level_read(&battery);
         xQueueReceive(vidQueue, &bmp, portMAX_DELAY);
      }
   }

Here is my experimental branch: https://github.com/rickyzhang82/go-play/tree/exp-enable-double-buffering

rickyzhang82 avatar Jul 05 '18 03:07 rickyzhang82

This guarantees that NO overwrite in frame buffer during transferring.

No such guarantees are made. The buffers are passed by reference (pointer) so it is possible to overwrite data while it is being rasterized.

Is it possible that the completion of transferring SPI data doesn't mean that LCD finish display the result?

The completion is asynchronous. There is no concern for their completion other than that a transaction is available for use. This is in contrast to blocking for completion.

OtherCrashOverride avatar Jul 05 '18 03:07 OtherCrashOverride

@OtherCrashOverride

No such guarantees are made. The buffers are passed by reference (pointer) so it is possible to overwrite data while it is being rasterized.

Let me put it this way. Two different overwrite happens here:

  1. Overwrite primary and back frame buffer while data is transferring through SPI to internal GRAM of LCD controller.
  2. Once transferring is done Overwrite in frame buffer in internal GRAM of LCD controller.

The guarantee refers to point 1. I think this is true as I explain in

Reading through ili9341_write_frame_nes, this function won't return until SPI transaction is completed. This is done by polling spi_device_get_trans_result.

Between xQueueSend from custom_blit function and xQueueReceive from video task, they should synchronize that xQueueSend have to wait until xQueueReceive clear the slot.

My question is that once frame buffer data stays in internal GRAM of LCD controller, are there any guarantee that LCD show the frame before next SPI transaction starts?

rickyzhang82 avatar Jul 05 '18 10:07 rickyzhang82

As mentioned before, the LCD display will update over 2 times (70 fps) https://github.com/OtherCrashOverride/go-play/blob/73337cb7803f38fe15eacbe09a2816ed9f62009c/odroid-go-common/components/odroid/odroid_display.c#L102 before a single complete frame can be sent over SPI (40Mhz). When the LCD refresh happens, what ever has been sent is display whether its complete or not..

OtherCrashOverride avatar Jul 06 '18 00:07 OtherCrashOverride

That makes sense to me. Because frame buffer SPI transaction to internal GRAM of LCD controller is not in sync with LCD refresh rate. SPI transferring speed limit the frame rate to be half of LCD refresh rate. I think I should reduce SPI traffic by using techniques like interlacing and minimal rectangle update.

In any case, I found that if I follow your old way using memcpy to make a copy of primary/back frame buffer into a separate frame buffer in the heap and then do SPI transfer from that separate frame buffer, I don't see any severe tearing like my Youtube video. Note that the separate frame buffer is not defined as DMA_ATTR as primary/back frame buffer does. So I really don't get the secret of your dark magic. Your way should be under the same constrain that you stated in previous thread as well. But somehow yours work.

An off-topic: I saw that you throttle rendering frame rate in nes.c.

        bool renderFrame = ((skipFrame % 2) == 0);

        nes_renderframe(renderFrame);
        system_video(renderFrame);

        if (skipFrame % 7 == 0) ++skipFrame;
        ++skipFrame;

I saw NES emulator 60FPS in output. Does it imply you throttle rendering frame rate in 30 fps or something? I don't quite get if (skipFrame % 7 == 0) ++skipFrame;.

rickyzhang82 avatar Jul 06 '18 01:07 rickyzhang82

DMA is never done from the frame buffer. The data has to be processed by the CPU first.

As stated in the first post, the SPI LCD has a maximum update rate of 30fps. The emulators run at 60fps and discard frames. The "skipFrame % 7" introduces an irregularity to whether odd or even frames are discarded.

OtherCrashOverride avatar Jul 06 '18 03:07 OtherCrashOverride

DMA is never done from the frame buffer. The data has to be processed by the CPU first.

I see your point. It is ili9341_write_frame_nes. I will carefully review it.

rickyzhang82 avatar Jul 06 '18 11:07 rickyzhang82

Thank you @OtherCrashOverride & @rickyzhang82 for keeping this discussion public. I've been reviewing the current state of go and have been testing a few things as there are a ton of ILI9341 libraries, some of which stem from Adafruit's OSS library.

I'm going to carefully watch this thread as the tearing is probably the biggest limitation to this dev kit that people will complain of. Thank you for your hard work =)

jaygarcia avatar Jul 10 '18 00:07 jaygarcia

@OtherCrashOverride I have several interesting findings in my experimental branch:

  1. After reviewing your code in ili9341_write_frame_nes function in odroid_display.c. You are correct that DMA is done in line array there. So I don't have to define DMA_ATTR attribute for primary/back frame buffer. I think I will implement interlacing and delta update there. That's the right spot to try that logic.

  2. I read FreeRTOS doc to learn task and queue. I added ESP_LOGD to confirm that my implementation of double buffering doesn't introduce race condition of read/write in double frame buffer between xQueueSend and xQueueRecieve where swapping frame buffer and DMA from line array to LCD controller are done. In fact, I suspect that the serious flickering in my previous Youtube video is caused by the speed differences between LCD refresh speed and SPI transferring speed. I can see that printing out from ESP_LOGD slow down frame speed from 60 fps to 47 fps. The serious flickering has gone. See Video

rickyzhang82 avatar Jul 10 '18 01:07 rickyzhang82

@jaygarcia It is not my ideas at all. I got two ideas interlacing and delta update from juj's repository. His ili9341 work build on RasPi. But the improvement algorithm may apply to ESP32.

rickyzhang82 avatar Jul 10 '18 01:07 rickyzhang82

hi @rickyzhang82 i was also looking at the same interlacing example and was wondering if the overhead of transactions was too high for interlacing to work well, as each row "paint" would require a whole new transaction, resetting the starting address w/ the LCD and it's required overhead.

In regards to your point #2 above, are you swapping memory pointers at the end of of the SPI transfers? E.g. After send_continue_wait(); call?

jaygarcia avatar Jul 10 '18 19:07 jaygarcia

  1. Current implementation ili9341_write_frame_nes sends 5 lines in every SPI transaction using memory continue write command 0x3c, while the proposed interlacing or delta update (minimal rectangle update) will use at least 3 additional SPI transaction 0x2a (column address), 0x2b (page address) and 0x2c (memory write) for each line/rectangle update. For interlacing, if the overhead exits, I'd estimate that at least 2.5 times more SPI transaction in writing memory than the original approach. I don't know the overhead until I get it implemented.

  2. I swapped the buffer after xQueueSend. By the time there is one and only one empty slot available in the vidQueue, it implies that xQueueReceive has been done so that the slot has been freed. Because ili9341_write_frame_nes finished before xQueueReceive. send_continue_waite must been done within ili9341_write_frame_nes. So the answer is definitely YES. See code

rickyzhang82 avatar Jul 11 '18 02:07 rickyzhang82

Thanks @rickyzhang82 =). I've been trying to to follow your fork in addition to learning the codebase and your explanation has helped me =)

jaygarcia avatar Jul 11 '18 13:07 jaygarcia

Sorry I'm trying to understand, does this mean the tearing issue might be fixed in the future?

JasonB32 avatar Jul 20 '18 21:07 JasonB32

@OtherCrashOverride I encountered heap corruption when using spi_device_get_trans_result to synchronize dual line buffer in DMA over SPI bus.

I suspected that loading states from uSD in core 0 over SPI causing the problem Because the heap corruption happened around the time it tries to load saved state from uSD card.

I saw the original send_continue_line and send_continue_wait in odroid_display.c using spi_device_get_trans_result to sync after the whole screen data transfer is done. But spi_device_get_trans_result only calls one time only. I don't have a doc to explain if transaction number should match between spi_device_queue_trans and spi_device_get_trans_result . But the sample code from ESP-IDF does match.

rickyzhang82 avatar Jul 21 '18 02:07 rickyzhang82

@JasonB32 It is pure experimental work. I'm working on interlacing now.

rickyzhang82 avatar Jul 21 '18 02:07 rickyzhang82

Ohh you're a very good man @rickyzhang82 Thank you!!!

JasonB32 avatar Jul 21 '18 03:07 JasonB32

@rickyzhang82 The LCD SPI support has been reworked in the smsplusgx branch.

OtherCrashOverride avatar Jul 21 '18 19:07 OtherCrashOverride

@OtherCrashOverride did you notice a significant improvement in migrating from task queues to semaphores? =)

jaygarcia avatar Jul 21 '18 19:07 jaygarcia

@OtherCrashOverride

I noticed you convert task queue into semaphore.

Using bitmask 0x80 on t->user to release semaphore on line buffer looks like a good trick.

The original approach relies on spi_device_get_trans_result doesn't seem to work at all. Once the number of spi_device_queue_trans goes up, it is hard to match spi_device_get_trans_result to synchronize correctly.

PS: I commented out loading uSD card loading code. The heap corruption error is not related to shared SPI bus between LCD and uSD card. I will look into your approach and see if I can borrow your idea to make sure spi_device_queue_trans and spi_device_get_trans_result matches.

rickyzhang82 avatar Jul 21 '18 22:07 rickyzhang82

@jaygarcia I tried on smsplusgx branch. I didn't see obvious improvement in terms of tearing.

rickyzhang82 avatar Jul 21 '18 23:07 rickyzhang82

I figured out the heap corruption problem. It is because of mismatch between spi_device_queue_trans and spi_device_get_trans_result. Every one spi_device_queue_trans function call should have one and only one correspondingspi_device_get_trans_result.

In any case, I implemented interlacing on my exp-interlacing branch.

The result doesn't look great as I thought. This reminds me of the old days when my using Sony DV cassette tape, importing video by firewire and converting them by ffmpeg with interlacing enabled.

See demo video. But this is a good start for the next progressive update with minimal delta.

@jaygarcia, I don't see SPI transaction overhead is an issue so far. I can see that it can stay around 60 fps in interlacing.

@OtherCrashOverride I don't migrate my interlacing experiment to your new branch as I haven't seen the real benefit yet. I may revisit it when I finish progressive update next.

rickyzhang82 avatar Jul 22 '18 03:07 rickyzhang82

@jaygarcia The semaphore implementation is both more reliable and also more efficient (faster).

OtherCrashOverride avatar Jul 22 '18 03:07 OtherCrashOverride

@OtherCrashOverride

The previous task queue design did have reliability issue but this is due to the fact that mismatch between spi_device_queue_trans and spi_device_get_trans_result.

I'm more curious on the efficiency in your new implementation. Have you done any empirical performance test in semaphore vs task queue?

rickyzhang82 avatar Jul 22 '18 13:07 rickyzhang82

Empirical evidence was collected by running emulation tests with audio disabled and observing the frame rate. Evidence can also be observed in the slope change of the tearing patterns present.

OtherCrashOverride avatar Jul 22 '18 18:07 OtherCrashOverride

@OtherCrashOverride I skimmed through the FreeRTOS implementation in semaphore and task queue. It looks like they are both calling the same queue API, which it implies two implementation are the same things. Thus, I doubt there is real benefit in terms of performance between semaphore and task queue. For record, my conclusion is based on 'skimming through' source code. It might be wrong.

Could you please tell me how you can do a benchmark test like the slope change and etc?

I did find that it computes FPS. But I want to know how can it only measure CPU 0 can tell CPU1 performance. Also, I wonder when spi_device_get_trans_result blocking call is done (assuming max delay), does it imply DMA is done.

thanks

rickyzhang82 avatar Jul 25 '18 13:07 rickyzhang82