Modelling Large-scale Datasets Using Principal Component Analysis
Citation:ALAKKARI, SALAHEDDIN, Modelling Large-scale Datasets Using Principal Component Analysis, Trinity College Dublin.School of Computer Science & Statistics, 2020
Principal Component Analysis (PCA) is one of the most well-known unsupervised learning techniques used for dimensionality reduction and feature extraction. The main task of PCA is to compute a low-dimensional space that captures the maximal variability in the input dataset. Such a holistic linear representation is optimal in terms of the mean-squared-error. The basis vectors that form such a space correspond to the $k$ most significant eigenvectors of the sample covariance matrix. However, computing such eigenvectors is computationally expensive with quadratic computational dependence on the data size. The ever-increasing size of datasets necessitates investigating reduced-complexity methods to find such eigenvectors. A common treatment is to apply streaming PCA methods which aim to approximate the eigenvectors based on a single data pass with a linear computational cost. However, state-of-the-art streaming approaches are highly sequential and assume that samples are independent and identically distributed. In the first part of this thesis, we investigate the convergence of such methods when extended to the mini-batch mode, which is superior to the traditional fully online mode in terms of computation run-time. Furthermore, we propose an acceleration scheme for mini-batch streaming methods that are based on the Stochastic Gradient Approximation (SGA). Such methods provide the cheapest computational cost compared to other streaming algorithms. Based on empirical evaluation using the spiked covariance model and benchmark datasets, we show that applying our scheme significantly enhances the convergence of the original techniques in addition to outperforming other state-of-the-art methods. In the second part, we investigate the performance of PCA when applied in a partitioning manner in which attributes are divided into a number of subsets, and then the standard approach is performed on each subset separately. We study two strategies for mapping attributes to different sets, namely, Cell-based PCA (CPCA) where samples are spatially divided into smaller blocks and Band-based PCA (BPCA) where attributes are partitioned based on their values distribution. We show that such models have several advantages over the holistic approach, including enhanced reconstruction quality and increased scalability. We also find that the baseline model, obtained when randomly mapping attributes, is analogous to the holistic PCA which entails a more practical and parallel alternative to streaming PCA paradigms. Not only are these methods beneficial for data compression but they also provide lightweight representations that would enhance the accuracy and training time of deep learning models. Time-varying datasets of various physical observations are also addressed. We theoretically draw the analogy between many analytic physical models and the PCA eigenvalue problem. It is shown that, for a wide range of physical phenomena, the eigenvectors derived using PCA are analogous to the analytic physical model. Since time-varying datasets are no exception from the curse of dimensionality, we further evaluate the performance of streaming PCA methods on many time-varying physical observations.
Author: ALAKKARI, SALAHEDDIN
Publisher:Trinity College Dublin. School of Computer Science & Statistics. Discipline of Computer Science
Type of material:Thesis
Availability:Full text available