Rosella       Machine Intelligence & Data Mining

Computer Vision and GPU Computing

Computer Vision based on Machine Learning of Artificial Intelligence is a very compute inensive area. Especially, training models with millions of parameters takes huge computing resources. GPU is essential in training large deep neural network models. High performance GPU can boost speed hundreds or thousands of times faster than single CPU thread. There are several GPU programing platforms. The subsequent sections will discuss about them.

Two Golden Rules for GPU Programming

GPUs are great for right applications. But GPU calling has huge base and synchronization latency. So GPU will perform poorly for most applications. Here are two golden rules for GPU programming;

  1. In each gpu request, consecutive gpu cores should write on consecutive global memory locations. This will ensure write is performed efficiently in non conflicting manner. In addition, reading close global memory locations is essential.
  2. Create/enqueue a sequence of gpu requests that will be executed sequentially without cpu synchronization. CPU synchronization should be done at the last gpu request of the sequence.

The first rule minimizes global memory write/read latency. The second rule minimizes base and synchronization latency. If you can organize your program observing these two golden rules, you can gain good gpu performance. Otherwise performance will be poor!


OpenCL is an industry standard GPU programming platform. Supported systems include Windows, Linux, MacOS. ARM Android devices also have OpenCL drivers. However developer library is not provided. OpenCL can be a bit slow. This is due to the cost of argument value setting is high. This is the reason why OpenCL can be slower than Cuda which is discussed in the next section.

For OpenCL GPU programming guides, please read OpenCL Programing Guides with Example.


Cuda is Nvidia's proprietary GPU programming platform. In essence, OpenCL and Cuda are very similar as both are aiming high performance computing. Cuda can be a bit faster than OpenCL and simpler to code. However applications that do many enqueing, OpenCL and Cuda have similar performance. Deep neural network training is one such example.

For Cuda GPU programming guides, please read Cuda Programing Guides with Example Code.

OpenGL ES3 (GLES3)

OpenGL ES is for graphics processing. From version 3, GLES started to support compute shaders. So it can be used as alternative to OpenCL. However GLES3 is not widely implemented on edge computing devices. So jury is out.

Orange Pi 5

Orange Pi 5 has 64 shading GPU cores. OpenCL is supported on Ubuntu and Debian. Performance is quite good. OPi5 GPU can be 19 times faster than 8 OPi5 CPU cores combined. 25 million paraneter YOLO-like object detection model finishes in 0.68 second where as 8 CPU threads finishes in 13 seconds. Quite reasonable speed for computer vision.

Raspberry Pi 4

There is no OpenCL driver for Raspberry Pi 4. Nor OpenGL ES3. So GPU cannot be utilized. We have to rely on multiple CPU cores. This is a major drawback of Raspberry Pi.


Many ARM Android devices have OpenCL driver. But there is no current OpenCL library that developers can use to compile and link OpenCL based programs. So it's useless.

Alternative GPU library is OpenGL ES3. It does work. However when heavy duty jobs are enqueued, Android seems to skip jobs. So it works only for small computer vision models.

Apple's Metal

Metal is Apple's GPU programming platform. Like OpenGL, it is primarily for graphics processing. Compute shaders also provided. Objective-C version Metal works. But Swift version may not use GPU at all. Instead it runs GPU shader programs on CPU cores. Of course, it's much slower than GPU computing.