Improving parallelism through communication, partitioning and load balancing optimizations
Citation:
Servesh Muralidharan, 'Improving parallelism through communication, partitioning and load balancing optimizations', [thesis], Trinity College (Dublin, Ireland). School of Computer Science & Statistics, 2016, pp. 156Download Item:

Abstract:
Multi-core processors have been of great importance for the increase in performance of high-end computing in the last decade. This thesis focuses on several bottlenecks of parallel applications that utilize these next-generation parallel architectures. Streaming systems can process large sets of data with a small memory foot print. Due to this, streaming systems are a popular model for programs such as image and signal processing, financial trading systems, database systems and network data analysis. A common problem for such streaming systems is that several of them require low latency and high bandwidth communication channels. Therefore they can be easily affected by the performance of underlying communication system. This thesis proposes a possible solution to this problem in the form of a runtime framework that integrates the communication operations within stream applications and executes them in parallel on multi-core processors. In the first experiment we compare two applications compression and encryption and evaluate their performances by varying the number of senders and receivers. We are able to show a maximum speedup of approx ~8x over a single sender/receiver in the case of encryption and ~12x in the compression application. We also compare our framework with an MPI equivalent version of the application. In the case of compression which is communication bound the proposed framework performs ~25x better. In the encryption application which is compute bound we find that the MPI version of the application performs ~30x better. The results show that our framework is more suitable for streaming applications which are bottlenecked by intensive communication. In heterogeneous high performance computing (HPC) architectures with varying compute units, communication constraints and topology it is difficult to map the filters of a stream graph to the most appropriate cores. This thesis proposes the algorithm, Heterogeneous Multiconstraint Application Partitioner (HMAP) that can partition applications exhibiting task and data parallelism onto heterogeneous architectures resulting in increased performance. The heterogeneous compute clusters consist of processing elements exhibiting diffierent compute speeds, vector lengths, and communication bandwidths, which all need to be considered when partitioning the application and associated data. A staged partitioning approach is used to tackle this problem. From our experiments we see that the proposed HMAP algorithm outperforms Metis by around 1.5x on average and around 3x for large heterogeneous architectures. MPI provides a good programming model for exploiting parallelism between cluster nodes, but it is often possible to achieve greater speedups by exploiting shared memory and finer-grain parallelism within cluster nodes. MPI systems exist that use lightweight threads and coroutines instead of heavier processes to exploit such parallelism. We propose two main mechanisms that help reduce the overheads for such systems within shared-memory multi-core cluster nodes. First, a novel runtime system which uses fibers to oversubscribe cores that can reduce the impact of communication delays and imbalances in some applications is used. Second, a set of message passing mechanisms is proposed that exploit fibers and shared memory to reduce communication costs. These techniques result in speedups ranging from 3.8x to 1.2x over OpenMPI in the 16 process version. We also a propose an autotuner that determines a suitable set of parameters and from the experiments achieves a speedup of ~2x over baseline in some cases.
Sponsor
Grant Number
Irish Research Council Enterprise Partnership ; IBM Dublin ; Trinity College Research Studentship
Author: Muralidharan, Servesh
Advisor:
Gregg, DavidQualification name:
Doctor of Philosophy (Ph.D.)Publisher:
Trinity College (Dublin, Ireland). School of Computer Science & StatisticsNote:
TARA (Trinity’s Access to Research Archive) has a robust takedown policy. Please contact us if you have any concerns: rssadmin@tcd.ieType of material:
thesisAvailability:
Full text availableLicences: