deepface icon indicating copy to clipboard operation
deepface copied to clipboard

Batching on `.extract_faces` to improve performance and utilize GPU in full

Open galthran-wq opened this issue 10 months ago • 20 comments

Tickets

https://github.com/serengil/deepface/pull/1433 https://github.com/serengil/deepface/issues/1101 https://github.com/serengil/deepface/issues/1434

What has been done

With this PR, .extract_faces is able to accept a list of images

How to test

make lint && make test

Benchmarking on detecting 50 faces: image

For yolov11n, batch size 20 is 59.27% faster than batch size 1. For yolov11s, batch size 20 is 29.00% faster than batch size 1. For yolov11m, batch size 20 is 31.73% faster than batch size 1. For yolov8, batch size 20 is 12.68% faster than batch size 1.

galthran-wq avatar Feb 13 '25 13:02 galthran-wq

Do you have a branch in your fork that currently combines all the optimizations you've submitted. I'd like to start using them while the approval process is going

whats been the total speedup you've been able to see

skyler14 avatar Feb 16 '25 08:02 skyler14

I do. You can check https://github.com/galthran-wq/deepface/tree/master-enhanced

it combines these two PRs with some other small modifications:

  • .represent uses batched detector inference (here https://github.com/serengil/deepface/pull/1433 it only does batched embedding, because batched detection is not yet implemented)
  • .represent returns a list of list of dicts, if a batch of images is passed. This is neccessary to be able to recover to which images the resulting faces correspond to. It might be a good idea to include this change in the PR as well. You can check the test in the fork.

Not all of the detectors currently (both in this PR and in the fork) implement batching. In particular, YOLO does. I've found it to be optimal in terms of performance and inference speed. The only problem is installing both torch and tensorflow with GPU, but I've managed to somehow do that.

All in all, with the combination of yolov11m and Facenet, both using GPU, and batch size 100 (the largest I could fit in 4090) I am seeing aroung 15x speed boost, but that is highly dependent on the input images, the GPU (especially memory size). I've also had a quick peek and it seems like the performance on the CPU is improved as well.

@serengil FYI I would be happy to contribute the aforementioned modifications if we have progress on the PRs.

galthran-wq avatar Feb 16 '25 09:02 galthran-wq

I will review this PR this week i hope

serengil avatar Feb 16 '25 09:02 serengil

Seems this breaks the unit tests. Must be sorted.

serengil avatar Feb 16 '25 12:02 serengil

should be good now

galthran-wq avatar Feb 16 '25 14:02 galthran-wq

Nope, still failing.

serengil avatar Feb 16 '25 15:02 serengil

You implemented OpenCv, Ssd, Yolo, MtCnn and RetinaFace to accept list inputs

What if I send list to YuNet, MediaPipe, FastMtCnn, Dlib or CenterFace?

I assume an exception will be thrown, but users should see a meaningful message.

serengil avatar Feb 17 '25 13:02 serengil

@galthran-wq you are pushing too many things, would you please inform me when it is ready.

serengil avatar Feb 18 '25 10:02 serengil

@galthran-wq you are pushing too many things, would you please inform me when it is ready.

it is ready now, im usually pushing when it's done

I've just fixed what you've noted:

  • added pseudo-batching (for loop) for other models
  • support for np array batched input (dim=4)
  • changed extract_faces to return a list of lists on batched input
  • couple more tests

galthran-wq avatar Feb 18 '25 10:02 galthran-wq

is this ready? i can still see my comments not resolved.

serengil avatar Feb 20 '25 17:02 serengil

is this ready? i can still see my comments not resolved.

almost, tell me if you think that the batched numpy array test is okay and let's also settle on whether refactoring of current detect_faces for non-batched detectors is needed (your last comment).

I plan to:

  • add test comments
  • fallback to pseudo-batching on opencv and mtcnn and make sure the rtol stays at 0.01% then
  • (maybe) refactor detectors a bit

galthran-wq avatar Feb 20 '25 18:02 galthran-wq

To summarize what's changed:

  • I've added comments and additional checks to the tests.
  • I've made batching on opencv and mtcnn optional (due to the above issue). To enforce batching a user can set ENABLE_OPENCV_BATCH_DETECTION (or ENABLE_MTCNN_BATCH_DETECTION) to true.

Unfortunately, this didn't fix the batch extraction case on opencv -- the problem is that is occasionally fails (it seems that the predictions have some random behaviour, so the results might be different from run to run!). Note that it has nothing to do with batching, because it is disabled by default. We might add a separate issue and test to reproduce this.

  • i have fixed the special case of a single image in a list input (batch of size 1). It now indeed returns a list with a single element -- the list of detected faces in that image.
  • those detectors that do no implement batching all had repeating logic in detect_faces. I have moved this logic to the default implementation in Detector. Now, those detectors only need to implement _process_single_image, and batching would be supported by inheritance.
  • If a detector implements batching, then it overrides detect_faces with this logic, just as before.

galthran-wq avatar Feb 23 '25 14:02 galthran-wq

If detector is opencv or mtcnn, and input is batch, then we should raise an error. We don't need ENABLE_OPENCV_BATCH_DETECTION or ENABLE_MTCNN_BATCH_DETECTION environment variables.

TBH, i don't want to make this optional to users. They are raising issues when they have something is wrong.

serengil avatar Feb 24 '25 15:02 serengil

Please add unit test cases for opencv and mtcnn. When batch input is sent, then exception should be thrown.

serengil avatar Feb 24 '25 15:02 serengil

I don't like this approach:

those detectors that do no implement batching all had repeating logic in detect_faces. I have moved this logic to the default implementation in Detector. Now, those detectors only need to implement _process_single_image, and batching would be supported by inheritance.
If a detector implements batching, then it overrides detect_faces with this logic, just as before.

all detectors should have detect_faces instead of _process_single_image, and you should raise an error in that detector if it is not supported.

serengil avatar Feb 24 '25 15:02 serengil

Changes:

  1. reverted _process_single_image
  2. disabled option to use batch mode for mtcnn and opencv

so now they just use a for loop to process all the images, if a batch is passed. A user gets a warning that there is no actual batching happening.

Do you still think it is a good idea to raise an error?

galthran-wq avatar Mar 08 '25 15:03 galthran-wq

so now they just use a for loop to process all the images, if a batch is passed. A user gets a warning that there is no actual batching happening.

for loop is okay, don't raise error

serengil avatar Mar 09 '25 13:03 serengil

As I understand, you built a for loop in each detector. Is that really a batch?

What if we build a for loop in the parent of detectors, and don't change detectors?

serengil avatar Mar 09 '25 14:03 serengil

I mean batch is really running for yolo

We can create a method extract faces from batch in parent class of detectors, and only customize this for yolo.

In parent class, we basically build a for loop and call detector

In that way, detectors will not change

serengil avatar Mar 09 '25 15:03 serengil

this has been inactive for a long time, will close it if no objection

serengil avatar Jun 06 '25 13:06 serengil