Hey there everyone, I thought that as part of my first official programming post I would start out with some details on one of my most recent projects. As a term project for a Computer Vision Systems course I decided to implement a browser-based facial verification and identification system. The goal of this project was to provide an adaptive system capable of verifying registered users over multiple software and hardware platforms with limited requirements. I was able to create this fully portable system, capable of running on desktop or mobile chrome based browsers, using the following tools:
- tensorflow.js libraries
- face-api.js library
- ReactJS Framework
- Heroku Web Services
- Django REST Framework
Single Shot Multibox Detector (SSMD)
The SSMD network attempts to provide object detection and facial classification in one forward pass. Built on top of the standard VGG architecture, the SSMD network adds a series of increasingly smaller convolutional layers to extract features at varying sizes. Each of the auxillary layers is convolved by a m x n x p kernel filter to produce a bounding box offset and score for the predefined categories. Default bounding boxes are manually predefined and assigned to each feature map cell while matching predictions are calculated by the filter at each location. Hence the filters are applied for each class (c) and for each bounding box at each location in the feature map. Since the filters are applied multiple times, per map and at each map location, this process generate a large number of possible bounding boxes by offset. SSMD uses non-maximum suppression to prune possible bounding boxes based on thresholds for the objective loss (location confidence) and Intersection of Union (IoU) scores. This architecture accomplishes both detection and classification of objects at various sizes in one network pass, but still provides high speed inference and accuracy.
You Only Look Once (YOLO) / Tinyface
Unlike the multi-map with multi-box approach of SSMD, the YOLO network attempts to reform the tasks of facial detection as a single regression problem. A single convolutional network is used to predict bounding boxes and classification scores as one vector result. Instead of a sliding window or region proposal approach, YOLO applies classification using the entire image. This is achieved in one pass by by splitting the input image into a smaller grid, and then generating 2 sets of bounding boxes and class probabilities per grid cell. This allows YOLO to use features from the entire image to predict each bounding box, for all classifications, in an image simultaneously. Learning over the smaller mapped grid forces network to implicitly encode contextual information about classes and their appearance in the training/testing images using less parameters. The smaller size of this architecture makes it perfect for less powerful hardware found in mobile devices; though it does come at a slight cost of accuracy.
Multi Task Cascaded Convolutional Networks (MTCN)
Unlike both of the previous networks MTCN tackles the task using 3 separate deep networks. The first stage is a Proposal Network that identifies regions of interest for bounding boxes. The second stage is targeted to refine only the region proposed in the first phase. The third stage analyzes the refined region to identify 5 facial landmarks on areas of interest that should include a face. If all three phases result in a face classification then we should have reasonably high confidence in the result. However, this method requires 3 networks with a large number of parameters.
In the end the most adaptive network to various image conditions while still remaining accurate was: SSMD. Under various lighting conditions, distances, and angles it performed the best while not dropping framerate on a GPU enabled desktop. Tinyface was best applied to lower-end hardware or mobile devices; As it was able to provide good-enough accuracy with less parameters / resources.
Once a face has been detected, and a bounding box inferred, we can use the landmark detection tool to record a unique facial descriptor. This is where integration with other web tools and services comes into play. Many of these tools I had little prior experience with, but building a whole web app from back-end to front-end allowed for a great learning opportunity.
I chose React as a front end framework due to it’s responsive design and wide-spread use in the industry. React allowed for the creation of common web components that could be shared across the landing, registration, login, and multi-user pages. i.e the face detecting webcam component was used across multiple pages, with changes in configuration based on passed in properties. The ‘stateful’ design of React allowed for individual reloading / display of components based on changes in state. For example, fetching the pre-trained weights takes some time so a loading bar was presented. With React the loading bar was easily configured to change based on state changes from an async handler triggered when the full file is retrieved.
Django, MongoDB, Heroku
The resulting Vision application allows users to register with a user name and a guided landmark registration process. The registration page uses 68 facial landmark points (from one facial angle) and a login name to create and store a user. To support this functionality I set up a Django instance using a basic REST configuration running on a dynamically allocated Heroku VM. User creation calls are picked up by Django, de-serialized, and stored in a MongoDB (only store user name and 68 points). When a user accesses the landing page a request is sent to the heroku instance as a ‘wake’ signal, otherwise under no use the vm is disabled. This allows for a low-overhead persistent user DB that is easily integrated into the whole application pipeline.
Overall, this was a fun and interesting project with many learning opportunities. I plan on taking some of these pieces and applying it to other projects currently in-work. If you found this project interesting, or have any questions comment down below!