Batching on `.extract_faces` to improve performance and utilize GPU in full
Tickets
https://github.com/serengil/deepface/pull/1433 https://github.com/serengil/deepface/issues/1101 https://github.com/serengil/deepface/issues/1434
What has been done
With this PR, .extract_faces is able to accept a list of images
How to test
make lint && make test
Benchmarking on detecting 50 faces:
For yolov11n, batch size 20 is 59.27% faster than batch size 1. For yolov11s, batch size 20 is 29.00% faster than batch size 1. For yolov11m, batch size 20 is 31.73% faster than batch size 1. For yolov8, batch size 20 is 12.68% faster than batch size 1.
Do you have a branch in your fork that currently combines all the optimizations you've submitted. I'd like to start using them while the approval process is going
whats been the total speedup you've been able to see
I do. You can check https://github.com/galthran-wq/deepface/tree/master-enhanced
it combines these two PRs with some other small modifications:
.representuses batched detector inference (here https://github.com/serengil/deepface/pull/1433 it only does batched embedding, because batched detection is not yet implemented).representreturns a list of list of dicts, if a batch of images is passed. This is neccessary to be able to recover to which images the resulting faces correspond to. It might be a good idea to include this change in the PR as well. You can check the test in the fork.
Not all of the detectors currently (both in this PR and in the fork) implement batching. In particular, YOLO does. I've found it to be optimal in terms of performance and inference speed. The only problem is installing both torch and tensorflow with GPU, but I've managed to somehow do that.
All in all, with the combination of yolov11m and Facenet, both using GPU, and batch size 100 (the largest I could fit in 4090) I am seeing aroung 15x speed boost, but that is highly dependent on the input images, the GPU (especially memory size). I've also had a quick peek and it seems like the performance on the CPU is improved as well.
@serengil FYI I would be happy to contribute the aforementioned modifications if we have progress on the PRs.
I will review this PR this week i hope
Seems this breaks the unit tests. Must be sorted.
should be good now
Nope, still failing.
You implemented OpenCv, Ssd, Yolo, MtCnn and RetinaFace to accept list inputs
What if I send list to YuNet, MediaPipe, FastMtCnn, Dlib or CenterFace?
I assume an exception will be thrown, but users should see a meaningful message.
@galthran-wq you are pushing too many things, would you please inform me when it is ready.
@galthran-wq you are pushing too many things, would you please inform me when it is ready.
it is ready now, im usually pushing when it's done
I've just fixed what you've noted:
- added pseudo-batching (for loop) for other models
- support for np array batched input (dim=4)
- changed
extract_facesto return a list of lists on batched input - couple more tests
is this ready? i can still see my comments not resolved.
is this ready? i can still see my comments not resolved.
almost, tell me if you think that the batched numpy array test is okay and let's also settle on whether refactoring of current detect_faces for non-batched detectors is needed (your last comment).
I plan to:
- add test comments
- fallback to pseudo-batching on
opencvandmtcnnand make sure the rtol stays at 0.01% then - (maybe) refactor detectors a bit
To summarize what's changed:
- I've added comments and additional checks to the tests.
- I've made batching on
opencvandmtcnnoptional (due to the above issue). To enforce batching a user can setENABLE_OPENCV_BATCH_DETECTION(orENABLE_MTCNN_BATCH_DETECTION) totrue.
Unfortunately, this didn't fix the batch extraction case on opencv -- the problem is that is occasionally fails (it seems that the predictions have some random behaviour, so the results might be different from run to run!). Note that it has nothing to do with batching, because it is disabled by default. We might add a separate issue and test to reproduce this.
- i have fixed the special case of a single image in a list input (batch of size 1). It now indeed returns a list with a single element -- the list of detected faces in that image.
- those detectors that do no implement batching all had repeating logic in
detect_faces. I have moved this logic to the default implementation inDetector. Now, those detectors only need to implement_process_single_image, and batching would be supported by inheritance. - If a detector implements batching, then it overrides
detect_faceswith this logic, just as before.
If detector is opencv or mtcnn, and input is batch, then we should raise an error. We don't need ENABLE_OPENCV_BATCH_DETECTION or ENABLE_MTCNN_BATCH_DETECTION environment variables.
TBH, i don't want to make this optional to users. They are raising issues when they have something is wrong.
Please add unit test cases for opencv and mtcnn. When batch input is sent, then exception should be thrown.
I don't like this approach:
those detectors that do no implement batching all had repeating logic in detect_faces. I have moved this logic to the default implementation in Detector. Now, those detectors only need to implement _process_single_image, and batching would be supported by inheritance.
If a detector implements batching, then it overrides detect_faces with this logic, just as before.
all detectors should have detect_faces instead of _process_single_image, and you should raise an error in that detector if it is not supported.
Changes:
- reverted
_process_single_image - disabled option to use batch mode for mtcnn and opencv
so now they just use a for loop to process all the images, if a batch is passed. A user gets a warning that there is no actual batching happening.
Do you still think it is a good idea to raise an error?
so now they just use a for loop to process all the images, if a batch is passed. A user gets a warning that there is no actual batching happening.
for loop is okay, don't raise error
As I understand, you built a for loop in each detector. Is that really a batch?
What if we build a for loop in the parent of detectors, and don't change detectors?
I mean batch is really running for yolo
We can create a method extract faces from batch in parent class of detectors, and only customize this for yolo.
In parent class, we basically build a for loop and call detector
In that way, detectors will not change
this has been inactive for a long time, will close it if no objection