Updated: Jul 6
In many computer vision applications, detection algorithms are the ‘bread and butter’, whether it be for detecting faces, people, vehicles, or any other objects. If you’re developing a computer vision application, you’re probably familiar with some of the modern detection algorithms such as single shot detector (SSD) or YOLO (You Only Look Once). While these algorithms work quite well for many applications, the state-of-the-art detection models deliver an increase in performance, accuracy, and functionality. As an embedded developer, your primary task for using a detection algorithm should be collecting and labeling the training data for the model (although you might need a good data scientist to assist on this). But you’ll also be tasked for selecting the most appropriate detection algorithm for your application – and it seems like new ones are popping up regularly. This blog will provide you some insight into the pros and cons of three popular algorithms, including SSD, YOLO, and CenterNet.
Detection Begins with a Strong Backbone
Classification models, for example, MobileNet and ResNet, generally contain two parts – a feature extractor (otherwise known as the backbone) and the classification header (or output). A complete classification model needs a classifier header that provides the inference results. However, when combined with a detection model, the classifier header is not used, so the output of the backbone feeds directly into the head of the detection model.
There are many choices when it comes to selecting a backbone for a detection model, although some detection models are designed with specific backbones. For example, YOLO is typically connected to a Darknet backbone. SSD is typically associated with MobileNet. CenterNet has more flexibility and the backbone, such as ResNet, DLA, Hourglass, and others, are selected based on performance and accuracy requirements.
Typically, you can find free backbones and detection models in public model zoos. These come pre-trained with datasets, so you won’t have to train them from scratch. Let Google and others do this heavy lifting on their