top of page

DNN-Based Object Detectors

Updated: May 11, 2022

Introduction, Approaches, Comparisons and Trade-offs

Unlike image classifiers, which simply report on the most important objects within an image, object detectors determine where objects of interest are located, their sizes and class labels within an image. Consequently, object detectors are central to numerous computer vision applications.

In this blog, we provide a technical introduction to deep-neural-network-based object detectors. We explain how these algorithms work, and how they have evolved in recent years, utilizing examples of popular object detectors. We will discuss some of the trade-offs to consider when selecting an object detector for an application, and touch on accuracy measurement. We also discuss performance comparison among the models discussed in this blog.

Early Approaches

The seminal work in this regard was reported in the form of technical report at UC Berkeley, in 2014 by Ross Girshick et. al. entitled "Rich feature hierarchies for accurate object detection and semantic segmentation". This is popularly known as “R-CNN” (Regions with CNN features). They approached this problem in a methodical way with having 3 distinct algorithmic stages which is shown in Figure 1 below.

Figure 1: R-CNN [“Rich feature hierarchies for accurate object detection and semantic segmentation”]

As seen above, there are three stages:

  • Region Proposal: Generate and extract category independent region proposals, using selective search.

  • Feature Extractor: Extract feature from each candidate region using a deep CNN

  • Classifier. Classify features as one of the known classes using linear SVM classifier model.

Here the main issue was that it produces lots of overlapping bounding boxes and post processing and filtering was needed to produce reliable results. Consequently, it was very slow and require 3 different algorithms to work in tandem. A faster version of this approach was published in 2015 entitled “Fast R-CNN” where authors combined second and third stages into one network with two outputs - one for classification and the other for bounding-box regression.

Shortly afterwards, they published “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks” where region proposal scheme was also implemented using a separate neural network.

Recent Approaches


In 2016 a mark departure was proposed by Joseph Redmon entitled "You Only Look Once: Unified, Real-Time Object Detection" or in short, YOLO. It divides the input image into an S × S grid. If the centre of an object falls into a grid cell, that grid cell is responsible for detecting that object. Each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities. The Figure 2 below shows the basic idea.

Figure 2: YOLO [You Only Look Once: Unified, Real-Time Object Detection]