mediapipe
mediapipe copied to clipboard
Tasks-Vision Image Segmentation Selfie Landscape GPU Output Misaligned and Poor Quality
Have I written custom code (as opposed to using a stock example script provided in MediaPipe)
None
OS Platform and Distribution
Windows 11
Mobile device if the issue happens on mobile device
No response
Browser and version if the issue happens on browser
No response
Programming Language and version
Javascript (Chrome Version 121.0.6167.161 (Official Build) (64-bit))
MediaPipe version
0.10.9
Bazel version
No response
Solution
Tasks Vision Image Segmentation - Selfie Segmentation
Android Studio, NDK, SDK versions (if issue is related to building in Android environment)
No response
Xcode & Tulsi version (if issue is related to building for iOS)
No response
Describe the actual behavior
Output is vertically offset by about one texel (1/144) in common scenarios
Describe the expected behaviour
Output should be correctly positioned
Standalone code/steps you may have used to try to get what you need
You can see this behavior in the demo itself:
https://mediapipe-studio.webapps.google.com/demo/image_segmenter
This seems to primarily effect the landscape model, as this one has the output about one texel below where it should be. For me this happens only when GPU inference is turned on. Interestingly, if I am using my Intel iGPU instead of my 3090, the GPU inference option does yield correctly positioned output. Even if I shift the output back up one pixel, the quality is better but still rough, so it might not be as simple as a one texel shift. The legacy solution does not have this problem regardless of CPU/GPU, and on a side note, the new solution has significantly higher CPU, GPU, and RAM consumption also.
Other info / Complete Logs
No response
For the side-note: MediaPipe Studio is a complicated environment (there's a bunch of extra stuff going on there), so I believe that if you try profiling a standalone demo like the CodePen instead (https://mediapipe-studio.webapps.google.com/demo/image_segmenter), you will get much better measurements on CPU/GPU/RAM consumption.
Also, even the CodePen demos can be made quite a bit faster and more efficient if the visualization rendering is performed on GPU as well-- more information can be found on #4630, and it looks like there's at least one sample which now uses this more efficient path: https://github.com/googlesamples/mediapipe/pull/212.
Please let us know if trying either/both of the above doesn't improve things!
I'd also be curious to hear if the output mis-positioning you're experiencing is also fixed by changing demo environment (that would tell us if the issue is just visualization in Studio, or if the problem is possibly something else).
Because there is an overlay rather than a background swap, the misalignment is a lot easier to see in the main demo (https://mediapipe-studio.webapps.google.com/demo/image_segmenter), but it appears to still be present in the tutorial background segmenter (https://github.com/googlesamples/mediapipe/tree/main/tutorials/background_segmenter). The general way to tell is to get your arm/hand just right so that it is being detected as foreground, and you can see that the top gets cut off a bit and the bottom includes a bit extra. Once again, I only get this on my 3090 and not when I am using CPU or when I have my iGPU selected as my GPU.
Tutorial Segmenter GPU Enabled (RTX 3090):
Tutorial Segmenter CPU Enabled:
With a green screen, the misalignment in the GPU version becomes much more clear, but it is really easy to miss the difference in most situations, where it is just a slight appearance of the image segmentation being worse.
With regards to the performance, even in the tutorial demo, I am seeing about 2x more CPU and GPU consumption than the legacy implementation, and the memory consumption is closer but still worse. I am curious about how it works under the covers and why it needs so much CPU, as my naive thought process is that a 255x144 image would just be sent to the GPU and then go through several neural network layers and then output as the mask, which should barely use the CPU at all.
The misalignment is interesting-- since it's GPU-specific, makes me wonder which operations on your GPU are resulting in these differences (like whether the issue is model-specific as well, or if it only occurs for confidence mask versus category mask, etc.). Narrowing that down as much as possible sounds like a good next step.
As for the performance, if everything is being done GPU-side, then the new API should be at least as efficient, if not even more so. However, the new API does have a lot of additional options that the previous one didn't have. So it would be good to verify between old and new API tests that everything else being tested for is the same. In particular, that: (a) the model being used is the same (b) only the confidence mask is being produced (the old API didn't have category masks-- those are a more complicated optional feature) (c) the ML inference is using GPU for both (d) there is no CPU postprocessing accidentally occurring when using the new API when the results are being drawn to screen, or when the mask is being handled (e) the demo itself is comparable (the "background_segmenter" demo performs a blur and does a bunch of imageBitmap conversions, for example).
I suspect that at least one if not more of these cases (a)-(e) is causing the difference in work, since otherwise the pipeline should be almost entirely GPU, and it should be very efficient, and there should be very almost no CPU processing being performed (except waiting for GPU processing to finish, which would likely show up in a profile as "readPixels" calls taking a long time).
At the very least, looking at the code makes me think that (e) would certainly cause some large performance differences. In particular, I see that the "background_segmenter" demo doesn't check if video.currentTime has increased when processing video frames (in order to prevent duplicate frame rendering), so it will very likely literally just be doing 2x or 3x the amount of work for no reason by processing the same frame 2 or 3 times in a row! So that's the first thing I'd try to fix.
To make things simpler and more direct for comparison, if the video.currentTime fix isn't enough, I think I'd then personally just scrap all the existing demos and make a much simpler and more direct test-- where I force everything to be the same (resolution, model, only confidence masks, GPU only, etc.), and then I'd take the mask texture (as a WebGLTexture) and render it directly myself using some same shader code into the canvas for final output. Otherwise, there's just too many differences between all these different demos/examples, and I think trying to hunt down deltas one-by-one would take a lot longer than just making a simple comparison from scratch that can compare the two APIs in an apples-to-apples way.
Also, if you capture screenshots of the performance profiling traces when running these demos, that can sometimes be enough to see where things are taking too long (or using CPU unexpectedly). So that's always a good thing to share for reference, since it's quick to capture, just in case anything noticeably off can be found in those.
I did some testing on different hardware for the texel offset: RTX 3090 (Original Test Case): Afflicted RTX 3080: Afflicted GTX 1070: Afflicted AMD R9 Fury X: Correct Behavior
So it seems to effect NVidia specifically, even across generations.
As for the performance, I made some bare minimum tests locally where I just loop the segmentation on a fixed image without touching the output so as to try and get as close to an apples to apples comparison as possible rather than comparing two differently constructed demos.
Legacy solution had the browser consuming 3.6% of the CPU and 33.3% of the GPU yielding about 180FPS. The tasks vision solution used 1.9% of the CPU and 33.4% of the GPU yielding about 190FPS.
The new solution is showing clear benefits for CPU consumption. The GPU usage is pretty close, though task manager doesn't seem to be great with giving exact GPU consumption. I guess the performance of the new solution is fine. I'm not sure if there are any optimizations that would invalidate a fixed image test, but I did provide an incrementing number for the time stamp.
Hi @Singulariteehee,
Could you please confirm whether you are still seeking a resolution to this issue, or if the matter has been resolved from your end?
Thank you!!
This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.
This issue was closed due to lack of activity after being marked stale for past 7 days.
I tried again in the current latest version 0.10.12, and the problem is still there. I forced the task to allow regionOfInterest and then offset it by 1/144 vertically when detecting NVidia for the GPU, which partially solves the problem, but it further reveals a pattern of janky lines, about 1/8 horizontal lines are messed up creating jagged shapes in the results. When I remove the regionOfInterest hack, I can still see the 1/8 horizontal line pattern, but it is just less noticeable because the whole thing is off. I am limited in how much I can investigate because the WASM source and unobfuscated javascript are both still unavailable.