NIST examines the effect of demographic differences on face recognition

31. December 2019

As part of its Face Recognition Vendor Test (FRVT) program, the U.S. National Institute of Standards and Technology (NIST) conducted a study that evaluated face recognition algorithms submitted by industry and academic developers for their ability to perform various tasks. The study evaluated 189 software algorithms submitted by 99 developers. It focuses on how well each algorithm performs one of two different tasks that are among the most common applications of face recognition.

The two tasks are “one-to-one” matching, i.e. confirming that a photo matches another photo of the same person in a database. This is used, for example, when unlocking a smartphone or checking a passport. The second task involved “one-to-many” matching, i.e. determining whether the person in the photo matches any database. This is used to identify a person of interest.

A special focus of this study was that it also looked at the performance of the individual algorithms taking demographic factors into account. For one-to-one matching, only a few previous studies examined demographic effects; for one-to-many matching, there were none.

To evaluate the algorithms, the NIST team used four photo collections containing 18.27 million images of 8.49 million people. All were taken from operational databases of the State Department, Department of Homeland Security and the FBI. The team did not use images taken directly from Internet sources such as social media or from video surveillance. The photos in the databases contained metadata information that indicated the age, gender, and either race or country of birth of the person.

The study found that the result depends ultimately on the algorithm at the heart of the system, the application that uses it, and the data it is fed with. But the majority of face recognition algorithms exhibit demographic differences. In one-to-one matching, the algorithm rated photos of two different people more often as one person if they were Asian or African-American than if they were white. In algorithms developed by Americans, the same error occurred when the person was a Native American. In contrast, algorithms developed in Asia did not show such a significant difference in one-to-one matching results between Asian and Caucasian faces. However, these results show that algorithms can be trained to achieve correct face recognition results by using a wide range of data.