Computer Vision for Face Recognition

Background

Computer vision uses artificial intelligence (AI) to identify and recognize images emulating the way the human brain perceives the visual world. This project looks at the performance of two facial recognition model architectures: Face Transformer, a vision transformers network and FaceNet, a convolutional neural network (CNN). The goal is to understand how they each perform in processing faces holistically, in the same way humans do. In other words, it looks at how the two models perform in recognizing faces as a whole, or features within a given face.

Hypothesis: Transformers models are more likely to holistically recognize and identify faces than convolutional neural networks.

Face Transformer

Face Transformer is a model that uses vision transformers to calculate a context-dependent representation of an image by nudging each patch of the image to its contextual neighbor. This model seems to achieve comparable performance to CNNs with a similar number of parameters.

FaceNet

The FaceNet architecture has three key components: the Convolutional Backbone, the Embedding Layer (a fully-connected layer, composed of 128 features that are learned from the output of the convolutional backbone) and the L2 Normalization (uses an Euclidian space to determine the classification of same and different).

The Experiment

Part I: Controlled Single-Face Variations

Variations include: inverted face, one-eyed, no eyes, no nose, no mouth, no hair, inverted features, inverted features and face.

Part II: Diverse-Face Variations

Using one face as anchor, we compare diverse faces against similar variations of specific features, such as eyes, nose and hair.

Part III: Results and Discussion

Hypothesis is rejected: Face Transformer is less likely to holistically recognize and identify faces than FaceNet.