The Experiment
Part I: Controlled Single-Face Variations
Variations include: inverted face, one-eyed, no eyes, no nose, no mouth, no hair, inverted features, inverted features and face.
Computer vision uses artificial intelligence (AI) to identify and recognize images emulating the way the human brain perceives the visual world. This project looks at the performance of two facial recognition model architectures: Face Transformer, a vision transformers network and FaceNet, a convolutional neural network (CNN). The goal is to understand how they each perform in processing faces holistically, in the same way humans do. In other words, it looks at how the two models perform in recognizing faces as a whole, or features within a given face.
Hypothesis: Transformers models are more likely to holistically recognize and identify faces than convolutional neural networks.
Face Transformer is a model that uses vision transformers to calculate a context-dependent representation of an image by nudging each patch of the image to its contextual neighbor. This model seems to achieve comparable performance to CNNs with a similar number of parameters.
The FaceNet architecture has three key components: the Convolutional Backbone, the Embedding Layer (a fully-connected layer, composed of 128 features that are learned from the output of the convolutional backbone) and the L2 Normalization (uses an Euclidian space to determine the classification of same and different).
Variations include: inverted face, one-eyed, no eyes, no nose, no mouth, no hair, inverted features, inverted features and face.