iotcookbook
iotcookbook copied to clipboard
create face detection demo app
We want an application which provides detection of human faces in a live video stream and can show this in a browser-based frontend.
This is microservice-based, and I see three components as part of this:
- camera capture component - This runs on a device to which a camera is connected. It captures a video stream/stream of images from the camera and transmits this as subscription events.
- face detection component - this receives a stream of images and emits the position of human faces in this stream
- video display component - this receives a stream of images and may receive a stream of human face positions regarding this stream. It displays the stream as a video and may mark the face positions within that stream.
Interaction:
- The display and processing in our demo is triggered from the video display component, e.g. by a user there selecting a camera from a menu, and additionally picking whether to display just the video or video and marked faces (potentially: how to mark the faces).
- Based on this the camera capture component is instructed to start transmitting images. (In addition to a simple on/off switch, this may also take a topic as an argument, if a single camera may serve consumers across multiple apps)
- if face detection is desired, the face detection component is given the topic of the camera capture component, as well as the topic to which to publish the face detection data
- in the video display component, the video data from the camera capture component is displayed, and markings of face positions are overlayed if desired
With this, we can use the video display component twice, once to show the raw image stream, once to show the face positions as part of our demo.
My initial (naive) assumption regarding coordination between the two data streams is that this is via a timecode generated by the camera capture component, which is then also used for the face detection data stream. The video display component can then cache either until the required pairs are present.
@om26er - does the above sound reasonable?
related/kinda superset of https://github.com/crossbario/iotcookbook/issues/27
Let me add some infos and background from my side, in particular regarding the ML stuff.
In ML software is often split into 2 pieces: a) training/learning and b) detection/run-time.
"face detection component": this would be the b) in above. It needs to load/access an already trained model, and only apply that model to new incoming data and output prediction ("Is this picture/video frame a human face, yes or no?")
so we actually should have 2 components for the ML part:
- pattern detection component
- pattern detection training component
for the demo, pattern == human face is perfect. but we should design the components in a way that generalizes to pattern (see below)
The specific ML algorithm that we should use for this is "Haar cascades". A good intro can be found here: http://www.willberger.org/cascade-haar-explained/
The output (when using OpenCV for haar cascades via cv2.CascadeClassifier
) is exactly 1 XML file == trained model.
The OpenCV project provides a bunch of ready-to-use trained models here: https://github.com/opencv/opencv/blob/master/data/haarcascades/
One model provided is haarcascade_frontalface_default.xml
, which is a haar cascade model trained to detect human faces.
Being XML, it is verbose, and can be compressed 10x: https://gist.github.com/oberstet/5f91645cb6d4497676b8cca7b83d12e5
The training component (different from the run-time component) essentially needs to do:
- Input: Two sets of raw images (positive and negative examples)
- Preprocessing/normalization of all images (eg size, color/grayscale, etc)
- Split latter preprocessed images into training and test set
- Train a model using the training set
- Test the trained model on the test set (to compute expected precision/recall and such)
- Output: Store the model as XML file
The detection run-time component (processing the live video frames) needs to do:
- load the model into a run-time classifier (which does not contain training ability)
- shuffles video frames through the detector
- publish notifications when faces are detected
We could for example have WAMP procs in the ML run-time component:
-
store_model(compressed_xml, label, description) -> UUID (= SHA fingerprint of XML)
: store the xml locally within the run-time component disk -
load_model(UUID) -> ok|error
: load a previously stored model - only works if no model is currently running -
run_model() -> ok|error
: start the previously loaded model, will begin to process live video frames (received from the camera capture component) -
stop_model() -> ok|error
: stop the currently running model (if any) -
list_models() -> [{UUID, label, descripton}]
And then eg have the ML training component call into store_model
etc etc
some more links:
- https://codeyarns.com/2014/09/01/how-to-train-opencv-cascade-classifier/
- https://codeyarns.com/2014/09/30/tips-for-using-opencv-cascade-classifier/
- http://answers.opencv.org/question/56573/visualizing-learned-features-from-haar-cascade-classifier/
- http://note.sonots.com/SciSoftware/haartraining.html
the reason for above generalization (pattern vs only faces plus run-time and training component) is: doing so makes this actually much more than a demo!
eg we could add further down the line UI that allows an end user to upload and define training sets of arbitrary pictures/images for other application:
- industrial user wants to detect "broken parts vs ok parts" in an industrial imaging setup
because: face detection is obviously not sth an industrial use would practically do. however, "broken parts vs ok parts" is actually very very relevant
For the initial version, which can use the existing model that Open CV provides
- the video capture component and the analysis component are in Python
- video display is in the browser