Arbitrary Precision and Low Complexity Micro-Architectural Arithmetic Optimisations of Machine Learning Algorithms for Compute Bound and High-Performance Systems

File Type:

PDF

Item Type:

Thesis

Date:

2021

Author:

Garland, James Philip

Access:

openAccess

Citation:

Garland, James Philip, Arbitrary Precision and Low Complexity Micro-Architectural Arithmetic Optimisations of Machine Learning Algorithms for Compute Bound and High-Performance Systems, Trinity College Dublin.School of Computer Science & Statistics, 2021

Download Item:

thesisPrintedJamesGarland.pdf (Accepted for publication (author's copy) - Peer Reviewed) 11.43Mb

Abstract:

Artificial intelligence is becoming ubiquitous and pervasive in our daily lives. Machine learning (ML), a subset of Artificial intelligence (AI), supplies more accurate internet searches, voice recognition in home appliances, tagging people in photos, object detection in videos, and driver assistance systems in vehicles. Convolutional neural networks (CNNs), a subset of ML, process these images, videos and sometimes audio data. Captured and preprocessed by embedded internet of things (IoT) devices, CNN data are often processed in internet data centres or on local PCs with high-performance processors and acceleration cards, due to CNNs enormous energy, bandwidth, and processing requirements. There is a need to move more of this CNN processing to IoT edge and embedded devices for low-power and potentially offline, processing. The CNN convolution layer consists of millions of multiply-accumulates (MACs), the arithmetic of which can be in fixed-point, integer or floating-point format. The CNN can operate in training mode or inference mode. During inference, the convolution layer occupies up to 90% of the computation time and energy of the CNN, convolving the input feature map (IFM) with the kernel weight data. The storage, movement of weight data, and acceleration of the convolution computation are often beyond the energy, storage and compute bounds of embedded devices. We investigate opportunities for optimising the hardware energy efficiency, gate-level area, and execution time of the CNN convolution layer s MAC arithmetic, while maintaining inference classification accuracy of the CNN accelerator implementation. Our first contribution investigates reducing energy consumption and application-specific integrated circuit (ASIC) die area while maintaining classification accuracy of CNNs. We also investigate latency and resource efficiency when implemented in field programmable gate array (FPGA). Our second contribution focuses on decreasing software execution time of low-precision floating-point (FP) CNNs by exploiting hardware optimisation of central processing unit (CPU) vector register packing and single instruction multiple data (SIMD) bitwise instructions used in the CNN MAC.

URI:

http://hdl.handle.net/2262/97651

Sponsor

Grant Number

SFI stipend

12/IA/1381

Author's Homepage:

https://tcdlocalportal.tcd.ie/pls/EnterApex/f?p=800:71:0::::P71_USERNAME:JGARLAND

Description:

APPROVED

Author: Garland, James Philip

Advisor:

Gregg, David

Publisher:

Trinity College Dublin. School of Computer Science & Statistics. Discipline of Computer Science

Type of material:

Thesis

URI:

http://hdl.handle.net/2262/97651

Collections:

Availability:

Full text available

Keywords:

CNN, power efficiency, multiply accumulate, arithmetic hardware circuits, ASIC, FPGA, bitslice parallel arithmetic, datapath circuits, hardware accelerators, reduced floating-point precision arithmetic, convolutional neural networks, approximate computing

Show full item record

Licences:

Original License

Browse

All of TARA

This Collection

Statistics