Design of Image Recognition

What is image recognition?

Image recognition is the ability of a computer-powered camera to identify and detect objects or features in a digital image or video. It is a method for capturing, processing, examining, and sympathizing images. To identify and detect images, computers use machine vision technology that is powered by an artificial intelligence system

Modes and types of image recognition

There are 3 types of modes are available in image recognition i.e., single class, binary class, multiple class

In single-class image recognition, models predict only one model. E.g., if we are training a cat or dog recognition model, a picture with a dog and a cat will still only be assigned by a single label.

In cases where only two classes are involved (dog; no dog), we refer those models to binary class.

Multiclass recognition models can assign more than two labels. E.g., if we are training a cat, dog, car model the multiclass model typically output a confidence score for each possible class, describing the probability that the image belongs to that class

The basic structure for image recognition

Nearly all image recognition models begin with an encoder. Encoders are made up of blocks of layers that learn statistical patterns in the pixels of images that correspond to the labels they’re attempting to predict. High-performing encoder designs featuring many narrowing blocks stacked on top of each other provide the “deep” in “deep neural networks”.

The encoder is then typically connected to a fully connected or dense layer that outputs confidence scores for each possible label. It’s important to note here that image recognition models output a confidence score for every label and the input image. In the case of single-class image recognition, we get a single prediction by choosing the label with the highest confidence score. In the case of multi-class recognition, final labels are assigned only if the confidence score for each label is over a particular threshold.

Finally, a note about accuracy. Most image recognition models are benchmarked using common accuracy metrics on common datasets. Top-1 accuracy refers to the fraction of images for which the model output class with the highest confidence score is equal to the true label of the image. Top-5 accuracy refers to the fraction of images for which the true label falls in the set of model outputs with the top 5 highest confidence scores.

How does image recognition work?

Image recognition is one of the tasks in which deep neural networks (DNNs) excel. Neural networks are computing systems designed to recognize patterns. Their architecture is inspired by the human brain structure, hence the name. They consist of three types of layers: input, hidden layers, and output. The input layer receives a signal, the hidden layer processes it, and the output layer makes a decision or a forecast about the input data. Each network layer consists of interconnected nodes (artificial neurons) that do the computation.

The number of hidden layers: While traditional neural networks have up to three hidden layers, deep networks may contain hundreds of them.

The architecture of a neural network, each layer consists of nodes. The number of hidden layers is optional.
Block diagram for image recognition

How neural networks learn to recognize patterns

How do we understand whether a person passing by on the street is an acquaintance or a stranger (complications like short-sightedness aren’t included)? We look at them, subconsciously analyze their appearance, and if some inherent features — face shape, eye color, hairstyle, body type, gait, or even fashion choices — match with a specific person we know, we recognize this individual. This brainwork takes just a moment.

So, to be able to recognize faces, a system must learn their features first. It must be trained to predict whether an object is X or Z. Deep learning models learn these characteristics differ from machine learning (ML) models. That’s why model training approaches are different as well.

Training deep learning models (such as neural networks)

To build an ML model, data scientists must specify what input features (problem properties) the model will consider in predicting a result. That may be a customer’s education, income, lifecycle stage, product features, or modules used, number of interactions with customer support, and their outcomes. The process of constructing features using domain knowledge is called feature engineering.

If we were to train a deep learning model to see the difference between a dog and a cat using feature engineering. Well, imagine gathering the characteristics of billions of cats and dogs that live on this planet. We can’t construct accurate features that will work for each possible image while considering such complications as viewpoint-dependent object variability, background clutter, lighting conditions, or image deformation. There should be another approach, and it exists thanks to the nature of neural networks.

Neural networks learn features directly from data with which they are trained, so specialists don’t need to extract features manually.

The training data, in this case, is a large dataset that contains many examples of each image class. When we say a large dataset, we really mean it. For instance, the ImageNet dataset contains more than 14 million human-annotated images representing 21,841 concepts (synonym sets or synsets according to the WordNet hierarchy), with 1,000 images per concept on average.

The illustration of how a neural network recognizes a dog in an image.

Each image is labeled with a category it belongs to — a cat or dog. The algorithm explores these examples, learns about the visual characteristics of each category, and eventually learns how to recognize each image class. This model training style is called supervised learning.

The Example of feature hierarchy learned by a deep learning model on faces

Each layer of nodes trains on the output (feature set) produced by the previous layer. So, nodes in each successive layer can recognize more complex, detailed features — visual representations of what the image depicts. Such a “hierarchy of increasing complexity and abstraction” is known as feature hierarchy.

The Example of feature hierarchy learned by a deep learning model on faces

So, the more layers the network has, the greater its predictive capability.

The leading architecture used for image recognition and detection tasks is Convolutional Neural Networks (CNNs). Convolutional neural networks consist of several layers with small neuron collections, each of them perceiving small parts of an image. The results from all the collections in a layer partially overlap in a way to create the entire image representation. The layer below then repeats this process on the new image representation, allowing the system to learn about the image composition.

Model architecture overview

Many neural network architectures exist for image recognition. Given the simplicity of the task, it’s common for new neural network architectures to be tested on image recognition problems and then applied to other areas, like object detection or image segmentation.


AlexNet was a deep neural network that won the ImageNet classification challenge in 2012 by a huge margin. Though it wasn’t the first convolution neural network to be used for image recognition or even win this particular challenge, it’s widely credited with sparking a resurgence of interest in using deep convolutional neural networks to solve computer vision problems. The network, however, is relatively large, with over 60 million parameters and many internal connections, thanks to dense layers that make the network quite slow to run in practice.


Two years after AlexNet, researchers from the Visual Geometry Group (VGG) at Oxford University developed a new neural network architecture dubbed VGGNet. VGGNet has more convolution blocks than AlexNet, making it “deeper”, and it comes in 16- and 19-layer varieties, referred to as VGG16 and VGG19, respectively.

The deeper network structure improved accuracy but also doubled its size and increased runtimes compared to AlexNet. Despite the size, VGG architectures remain a popular choice for server-side computer vision models due to their usefulness in transfer learning. VGG architectures have also been found to learn hierarchical elements of images like texture and content, making them popular choices for training style transfer models.


The Inception architecture, also referred to as GoogLeNet, was developed to solve some of the performance problems with VGG networks. Though accurate, VGG networks are very large and require huge amounts of computing and memory due to their many densely connected layers.

The Inception architecture solves this problem by introducing a block of layers that approximates these dense connections with sparser, computationally efficient calculations. Inception networks were able to achieve comparable accuracy to VGG using only one-tenth of the number of parameters.


The success of AlexNet and VGGNet opened the floodgates of deep learning research. As architectures got larger and networks got deeper, however, problems started to arise during training. When networks got too deep, training could become unstable and break down completely.

ResNets, short for residual networks, solved this problem with a clever bit of architecture. Blocks of layers are split into two paths, with one undergoing more operations than the other, before both are merged back together. In this way, some paths through the network are deep while others are not, making the training process much more stable overall. The most common variant of ResNet is ResNet50, containing 50 layers, but larger variants can have over 100 layers. The residual blocks have also made their way into many other architectures that don’t explicitly bear the ResNet name.


Even the smallest network architecture discussed thus far still has millions of parameters and occupies dozens or hundreds of megabytes of space. SqueezeNet was designed to prioritize speed and size while, quite astoundingly, giving up little ground inaccuracy.

Despite being 50 to 500X smaller than AlexNet (depending on the level of compression), SqueezeNet achieves similar levels of accuracy as AlexNet. This feat is possible thanks to a combination of residual-like layer blocks and careful attention to the size and shape of convolutions. SqueezeNet is a great choice for anyone training a model with limited compute resources or for deployment on embedded or edge devices.


The MobileNet architectures were developed by Google with the explicit purpose of identifying neural networks suitable for mobile devices such as smartphones or tablets. They’re typically larger than SqueezeNet but achieve higher accuracy.

MobileNet architectures brought two important innovations to network designs: depth-wise separable convolutions and a hyperparameter known as a width multiplier. Depth wise separable convolutions are a replacement for traditional convolution layers, having fewer parameters and being more computationally efficient. The width multiplier is a parameter that controls how many parameters are used for each convolution layer. This allows for the creation of multiple networks along a tradeoff curve of size and speed versus accuracy. A continuum of models can be created with the same basic architecture so that more powerful devices can receive larger, more accurate models, while less powerful devices can use smaller, less accurate models.