libcimbar icon indicating copy to clipboard operation
libcimbar copied to clipboard

A few questions about the project

Open mvelbaum opened this issue 6 months ago • 3 comments

Hi @sz3,

Since there's no contact information anywhere I decided to ask a few questions here :) -

  1. Why does the QR code shake during encoding?
  2. What is the flow of end-to-end testing? Do you have to use the phone's camera when testing, or do you also have a faster way to test encoding&decoding without involving the phone initially (for fast edit-test-debug cycle), and test using camera only later.
  3. What kind of background does one need to work on a project like this? It seems to require at least some background on image/video processing, computer graphics, information theory (for the ECC and maybe compression stuff). Have you had this before or did you learn it just for this project? :) I'm only asking because I'd like to know what I should focus on to better understand this stuff :)

Thanks a lot!

mvelbaum avatar Jun 04 '25 19:06 mvelbaum

Hi! I should probably put my socials on my gh profile at least...

  1. Why does the QR code shake during encoding?

Some "in between" frames are blurry (actually a combination of 2 different frames), and will never decode -- garbage in, garbage out. Moving the code a bit makes the corner anchors overlap, and this lets us detect/fail the decode during the "scan" step rather than later.

Failing sooner is good because heat is our enemy and extra computation is extra heat. To put some concrete numbers we can measure, say it takes ~45ms to run the scan, another ~90ms to run the extract (where we apply a matrix transform to the input image to make the size match what we expect) and another ~90ms to run the decode (lots of cache misses and popcnts). Even if those extra 180ms were only an extra 45ms (e.g. when running on newer hardware than the old phone these numbers are from), it will always help to fail fast and move on to the next thing.

  1. What is the flow of end-to-end testing? Do you have to use the phone's camera when testing, or do you also have a faster way to test encoding&decoding without involving the phone initially (for fast edit-test-debug cycle), and test using camera only later.

It depends where we draw the box around "end-to-end". Does that mean the image->text->image component? Maybe it includes the error correction as well? Perhaps we also want to layer file compression? And away we go... 🙂

The way I think about it is there are two "end-to-end"s here -- one for the format, and one for the android app. When there are other decoder apps (e.g. when I get the webapp decoder working), those will have their own end-to-end test procedure as well.

The format can be tested without any cameras in the loop -- just running ./cimbar --encode ..., then ./cimbar to decode is sufficient. There are also automated tests that do this. There are additional automated regression tests to make sure things haven't changed. Besides that, the python implementation should match at every relevant layer (without ecc, with, without compression, etc), which is another useful thing to have.

The android app testing is manual. Physical constraints play such a large role -- thermal stuff is the biggest one, but cameras also sometimes react in unexpected ways. But because of those two factors (heat, camera), testing has to include real hardware. Mainly what I try to do is avoid performance regressions, controlling for current conditions as well as I can (measure how the old code does, measure the new code, measure the old code, measure the new code).

  1. What kind of background does one need to work on a project like this? It seems to require at least some background on image/video processing, computer graphics, information theory (for the ECC and maybe compression stuff). Have you had this before or did you learn it just for this project? :) I'm only asking because I'd like to know what I should focus on to better understand this stuff :)

Work on? Or create? Because I think those are slightly different answers.

The reason I started this project (besides that no one seemed to be looking at this problem space, and that void was compelling) was because it touches on many different areas of the field, and bumps into non-negotiable real world constraints. I didn't have particular expertise in a lot of this stuff, and maybe still don't -- but I have enough to make things work... I did already have a background as a software engineer generalist (vs a specialist) who's seen a lot of different things and worked with a lot of different technologies.

Probably the most important skill is working on performance-sensitive code against real hardware. So much of modern computing addresses resource problems by "pick a bigger EC2 instance", and when that's the answer at your day job it's hard to learn to tackle real optimization problems. Meanwhile, my test Snapdragon isn't getting any faster -- if anything, it's getting slower as the hardware degrades. 🙂

Besides the performance/optimization aspect, there's:

  • opencv/camera image processing (2D image transformations)
  • imagehashing (conceptual hashing)
  • error correction
  • file compression
  • webassembly
  • multithreading/parallelism (we could lump this in with performance, but I think it's worth splitting it out)

The list isn't so intimidating if you understand that there's no need to be (or become) an expert in each field. I don't really understand compression any better than I did going in -- I just know how to use the zstd library. And fountain codes (and to a lesser extent all error correction) still seem like mysterious dark magic. But I do know what these things are useful for and how to use them.

sz3 avatar Jun 06 '25 05:06 sz3

Thanks for such a detailed response, you rock! :)

Some "in between" frames are blurry (actually a combination of 2 different frames), and will never decode -- garbage in, garbage out. Moving the code a bit makes the corner anchors overlap, and this lets us detect/fail the decode during the "scan" step rather than later.

What do you mean by "in between" frames? Just frames that happened to be blurry between two frames that decode properly, or some video codec thing that does frame interpolation? I thought intuitively that shaking the QR code would make the frames blurry, but I guess I am wrong.

The way I think about it is there are two "end-to-end"s here -- one for the format, and one for the android app. When there are other decoder apps (e.g. when I get the webapp decoder working), those will have their own end-to-end test procedure as well.

Yeah. I figured the annoying parts would be tweaking the qr code so that the camera is actually able to get a lot of decodable frames because if the details inside the qr code are tiny + the blur you may not be able to decode them properly.

Work on? Or create? Because I think those are slightly different answers.

I meant create actually :). I had an idea for a side-project of this type using audio only, as a way to learn new things. One day I thought, hey, why not do video instead and thought that QR codes could be a good starting point. Lo and behold after some googling I found out that you already have this amazing project.

The reason I asked about your background is that I wanted to know how feasible it is to learn just enough of the prerequisites to be able to build & understand something like this. In my case, I have not done any image processing before, or used opencv or used ecc :), so it looks like a very interesting project.

P.S. what was the process behind creating the the shapes you use in the tiles?

mvelbaum avatar Jun 06 '25 07:06 mvelbaum

What do you mean by "in between" frames? Just frames that happened to be blurry between two frames that decode properly, or some video codec thing that does frame interpolation? I thought intuitively that shaking the QR code would make the frames blurry, but I guess I am wrong.

Cameras tend to be pretty bad at capturing off-screen video, and that's what we're doing here (with the added complexity that we need to process in real time). I'm not sure about the root case -- it's something I'd like to understand better at some point -- but one of the results is that the previous frame tends to persist as a sort of afterimage as the next one appears. There are physical constraints (exposure time, etc) that could cause this, as well as computing constraints (compression, etc). But from the perspective of a simple app, it doesn't matter too much if it's a physical constraint or an API constraint. Since we have this sort of "in between" frame smearing effect as we go from , those in between ones will almost always fail to decode.

Hence the shaking, so we fail faster.

Your intuition is probably picturing the camera itself shaking -- but in this case the rest of the frame is stationary, so the image "moving" within that context doesn't influence the focus. There's no penalty, it's just doing something extra.

Yeah. I figured the annoying parts would be tweaking the qr code so that the camera is actually able to get a lot of decodable frames because if the details inside the qr code are tiny + the blur you may not be able to decode them properly.

So, this is a function of the design. The barcode resolution -- how many 8x8 tiles inside the grid -- is calibrated for ~1080p (in practice ~900p is fine). One of the tweaks I've been working on is a version of cimbar that works on 720p cameras.

It was interesting to find a configuration at worked, since it requires some trial and error (and intuition). But I'd say the majority of work has been in selecting tilesets/colorsets, eeking out performance improvements, and refining the Matryoshka-doll-style design of how all the layers (symbols, colors, 2 levels of ecc + compression) interact.

P.S. what was the process behind creating the the shapes you use in the tiles?

Trial and error. There's some discussion on tileset discovery in this issue: https://github.com/sz3/libcimbar/issues/69

sz3 avatar Jun 07 '25 06:06 sz3