Deep neural networks can perform wonderful feats thanks to their extremely large and complicated web of parameters. But their complexity is also their curse: The inner workings of neural networks are often a mystery—even to their creators. This is a challenge that has been troubling the artificial intelligence community since deep learning started to become popular in the early 2010s.
FREMONT, CA: In tandem with the expansion of deep learning in various domains and applications, there has been a growing interest in developing techniques that try to explain neural networks by examining their results and learned parameters. But these explanations are often erroneous and misleading, and they provide little guidance in fixing possible misconceptions embedded in deep learning models during training.
In a paper published in the peer-reviewed journal Nature Machine Intelligence, scientists at Duke University propose “concept whitening,” a technique that can help steer neural networks toward learning specific concepts without sacrificing performance. Concept whitening bakes interpretability into deep learning models instead of searching for answers in millions of trained parameters. The technique, which can be applied to convolutional neural networks, shows promising results and can have great implications for how we perceive future research in artificial intelligence.
Given enough quality training examples, a deep learning model with the right architecture should be able to discriminate between different types of input. For instance, in the case of computer vision tasks, a trained neural network will be able to transform the pixel values of an image into its corresponding class. (Since concept whitening is meant for image recognition, we’ll stick to this subset of machine learning tasks. But many of the topics discussed here apply to deep learning in general.)
During training, each layer of a deep learning model encodes the features of the training images into a set of numerical values and stores them in its parameters. This is called the latent space of the AI model. In general, the lower layers of a multilayered convolutional neural network will learn basic features such as corners and edges. The higher layers of the neural network will learn to detect more complex features such as faces, objects, full scenes, etc.
Ideally, a neural network’s latent space would represent concepts that are relevant to the classes of images it is meant to detect. But we don’t know that for sure, and deep learning models are prone to learning the most discriminative features, even if they’re the wrong ones.
For instance, the following data set contains images of cats that happen to have a logo in the lower right corner. A human would easily dismiss the logo as irrelevant to the task. But a deep learning model might find it to be the easiest and most efficient way to tell the difference between cats and other animals. Likewise, if all the images of sheep in your training set contain large swaths of green pastures, your neural network might learn to detect green farmlands instead of sheep.
So, aside from how well a deep learning model performs on training and test data sets, it is important to know which concepts and features it has learned to detect. This is where classic explanation techniques come into play.
Much deep learning explanation techniques are post hoc, which means they try to make sense of a trained neural network by examining its output and its parameter values. For instance, one popular technique to determine what a neural network sees in an image is to mask different parts of an input image and observe how these changes affect the output of the deep learning model. This technique helps create heat maps that highlight the features of the image that are more relevant to the neural network.
Other post hoc techniques involve turning different artificial neurons on and off and examining how these changes affect the output of the AI model. These methods can help find hints about relations between features and the latent space.
While these methods are helpful, they still treat deep learning models like black boxes and don’t paint a definite picture of the workings of neural networks.
Explanation methods are often summary statistics of performance (e.g., local approximations, general trends on node activation) rather than actual explanations of the model’s calculations,” the authors of the concept whitening paper write.
For instance, the problem with saliency maps is that they often miss showing the wrong things that the neural network might have learned. And interpreting the role of single neurons becomes very difficult when the features of a neural network are scattered across the latent space.