Brad Scott
- Apr 15, 2021
- 4 min read

AI on an idle GPU

Updated: Jul 6, 2021

With the breadth of heterogeneous multicore cores available on today’s modern SoCs, embedded developers are routinely faced with the dilemma of which core to use for their machine learning inference engine on edge devices? Afterall, on high-end parts there can be array of compute block options from single and multicore Arm Cortex A72s and Cortex A53s, to GPUs, Arm Cortex M cores and others depending on the vendor and specific SoC of course.

source: https://www.nxp.com

Often the question arises, which core is best for edge inferencing on an IoT device. While, it is well known GPUs have become somewhat of the de-facto standard for training Neural Networks, there is no equivalent de-factor standard for the edge inferencing core. As developers grapple with determining which core on the SOC should be used, maybe it would be helpful to take a step back and ask, “Why are GPUs used for NN training?” Afterall CPUs are very good at executing complex algorithms, so why aren’t they ideal for training NN models? Well, memory bandwidth could be one key reason (there are others of course, such as parallel processing), but let’s start by looking at memory bandwidth. Training a NN usually requires a tremendous amount of data. And although a CPU is ideal for complex algorithms, it’s memory bandwidth (ability to pull in lots of data) maybe somewhat limited relative to a server class GPU. One way to think of it is, the CPU has been optimized for fast memory access (low latency), but it can only read small amounts of memory at a time- relatively speaking. Contrasting the CPU with the GPU, the GPU is not optimized for memory access: GPU memory accesses are burdened with larger overhead. However, the memory bandwidth of the GPU is very large… again, relatively speaking. Maybe one can think of the GPU as if it is a truck capable of hauling large amounts of data at one time (the overhead is the time needed to load the truck); while the CPU is a fast car capable of moving very quickly, but only comprising the capacity to transport smaller amounts of data. Maybe not the best analogy, but hopefully it works to illuminate the impact memory bandwidth has with regards to GPUs to galvanize the point.

From an execution perspective on an edge device, real-time machine learning inference is a computationally intensive process that can benefit from embedded GPUs due to the parallel nature of the underlying tensor math operations. When compared with a CPU, the GPU has many more cores that can be used for execution. Thus, a CPU takes more clock cycles to execute complex algorithms due to the sequential nature of operation and fewer cores; while the GPU can execute complex operations with far fewer clock cycles using parallel processing as a result of a greater number of processing cores.

Well, there you have it, with greater memory bandwidth and parallel processing the GPU should be considered for edge inferencing.

If only it was that straightforward!

Other factors have to be considered. For instance, understanding the performance gain of the GPU relative to the memory bandwidth overhead is important as any performance gains obtained by using the GPU can be offset by the latency lost as a result of the memory overhead. Also, for embedded SoC’s, the memory bandwidth may not be better than the CPU- in fact, it may be worse.

And, it is worth mentioning, there is another consideration lurking in the background with regards to using the GPU for edge inferencing: that is the not-so-obvious engineering effort required for inference engine optimization on the GPU. This includes many aspects such as structuring low-level primitives for the GPU, memory alignment, cache hits, data transfer rates; and of-course, how well the Network Kernel can be parallelized for the GPU. This routinely takes an understanding of not only the AI / ML use-case, but the entire system, as often the balance of the system application is running on the CPU core complex.

The net net is, the GPU can be an excellent off-load engine for ML inferencing. But not always. And when the GPU is used, optimizing the inference engine can be very challenging.

This is where Au-Zone’s DeepView Toolkit can really help. DeepView was designed to abstract out the complexities for embedded developers by providing an environment that can be used to enable AI / ML intelligence for edge devices. DeepViewRT is very portable highly optimized inference engine that has been tuned for embedded GPUs. However, DeepViewRT is unique in that is supports the spectrum of compute blocks on modern SoCs including Cortex A, M and GPUs. So, with DeepView Toolkit, it is a very straight forward process to switch between CPU and GPU inferencing to determine which is optimal for edge inferencing for a given platform and use-case. The built-in tools provide developers with the detailed performance insights required for inference engine optimization. With the DeepView Toolkit, developers can add AI intelligence on edge devices while meeting system requirements for performance, latency, power and memory consumption on the appropriate compute block for ML inferencing: including the GPU.

AI on an idle GPU

Recent Posts