Arbitrary Precision and Low Complexity Micro-Architectural Arithmetic Optimisations of Machine Learning Algorithms for Compute Bound and High-Performance Systems
Citation:Garland, James Philip, Arbitrary Precision and Low Complexity Micro-Architectural Arithmetic Optimisations of Machine Learning Algorithms for Compute Bound and High-Performance Systems, Trinity College Dublin.School of Computer Science & Statistics, 2021
thesisPrintedJamesGarland.pdf (Accepted for publication (author's copy) - Peer Reviewed) 11.43Mb
Artificial intelligence is becoming ubiquitous and pervasive in our daily lives. Machine learning (ML), a subset of Artificial intelligence (AI), supplies more accurate internet searches, voice recognition in home appliances, tagging people in photos, object detection in videos, and driver assistance systems in vehicles. Convolutional neural networks (CNNs), a subset of ML, process these images, videos and sometimes audio data. Captured and preprocessed by embedded internet of things (IoT) devices, CNN data are often processed in internet data centres or on local PCs with high-performance processors and acceleration cards, due to CNNs enormous energy, bandwidth, and processing requirements. There is a need to move more of this CNN processing to IoT edge and embedded devices for low-power and potentially offline, processing. The CNN convolution layer consists of millions of multiply-accumulates (MACs), the arithmetic of which can be in fixed-point, integer or floating-point format. The CNN can operate in training mode or inference mode. During inference, the convolution layer occupies up to 90% of the computation time and energy of the CNN, convolving the input feature map (IFM) with the kernel weight data. The storage, movement of weight data, and acceleration of the convolution computation are often beyond the energy, storage and compute bounds of embedded devices. We investigate opportunities for optimising the hardware energy efficiency, gate-level area, and execution time of the CNN convolution layer s MAC arithmetic, while maintaining inference classification accuracy of the CNN accelerator implementation. Our first contribution investigates reducing energy consumption and application-specific integrated circuit (ASIC) die area while maintaining classification accuracy of CNNs. We also investigate latency and resource efficiency when implemented in field programmable gate array (FPGA). Our second contribution focuses on decreasing software execution time of low-precision floating-point (FP) CNNs by exploiting hardware optimisation of central processing unit (CPU) vector register packing and single instruction multiple data (SIMD) bitwise instructions used in the CNN MAC.
Author: Garland, James Philip
Publisher:Trinity College Dublin. School of Computer Science & Statistics. Discipline of Computer Science
Type of material:Thesis
Availability:Full text available