Bayesian Tree Regression within a Streaming Context
Citation:
Ferreira, Michael Antonio, Bayesian Tree Regression within a Streaming Context, Trinity College Dublin, School of Computer Science & Statistics, Statistics, 2023Abstract:
Regression in a statistical streaming environment. Explore either large amounts of data or data that is continually being generated in a meaningful way. The streaming setting is challenging because either the proportion of data to be analysed far exceeds the available resources or the rate at which the data is arriving and the timeliness of the inference on that data are at odds with each other.
Bayesian methods for streaming regression analysis have focused on using particle filtering and sequential Markov chains. Bayesian regression trees have been used as particles because they offer a tractable approach to nonlinear regression by providing conditional basis functions that can be both smoothed over and still allow for sudden changes in the data to be modelled. The Kalman filter, arguably the progenitor of SMC methods, epitomises the Bayesian methodology for analysis by using data to confirm beliefs which then become the prior beliefs for new data.
MCMC methods have been largely ignored in the statistical streaming setting because ergodic averaging over Markov chains requires that stationarity of the chains of sample measurements be established. Introducing new data invalidates the claim of stationarity requiring that new chains of measures be sampled to re-establish stationarity. What has not been shown is whether MCMC can be used in the streaming setting if one is willing to accept that, at least temporarily, the theoretical requirements for certainty of stationarity be set aside in favour of reaching a target distribution that is, for all intents and purposes, either the same as or very close to the ``true'' target distribution.
This document sets out to show that, using Bayesian regression trees to provide a collection of conditional filters, MCMC can be used in the streaming setting for nonlinear, nonstationary regression.
A tree filter based on the Kalman filter is developed. This initial stepping stone shows that the explanatory variables are only necessary to indicate a refinement of a partition created by the tree within which a filter provides an estimate and prediction for the level of the signal in that refinement. Thus there is no need to store neither explanatory variables nor observations because, by the Markov assumption, all histories of the processes are retained in the previous state of the latent process at the refinements and in the tree model. This fixed tree filter is then developed into an on-the-fly adaptive learning model that searches the space of tree models for possible models as new data is provided. It is shown that by using Markov chain Monte Carlo it is possible to get sufficiently close to the target distribution having only seen each new data point once. A single tree represents only a single chain limiting the search of the model space so an ensemble of chains of tree measures is provided so that a more comprehensive search of the distribution of trees can be carried out. An approximation to the probability distribution of the trees is provided by this ensemble. A mixture of tree models over this distribution allows for tree model weighted predictions for the observations and estimates of the state along with their uncertainty estimates to be made on-the-fly.
Showing that MCMC can be used in the streaming setting opens up a whole gamut of MCMC methods for Bayesian statistical analysis that will broaden the scope of problems that could be tackled over large and streaming data sets. This method can be adapted to existing Bayesian tree regression methods and extended to cover variable selection. The independent nature of the trees and the fact that the algorithm has constant complexity with respect to the stream of data means that the size of the ensemble is only limited by available resources and is amenable to both parallel and concurrent computation. Almost any size problem can be explored using this method and, because the Kalman filter can handle vectors with ease, the dimension of the response is of concern only with respect to local (to the leaf filter) matrix manipulation. The model provides a method for autoregressive, on-the-fly Gaussian process regression but is also extendable to multi-output Gaussian process regression
Sponsor
Grant Number
Science Foundation Ireland (SFI)
Author's Homepage:
https://tcdlocalportal.tcd.ie/pls/EnterApex/f?p=800:71:0::::P71_USERNAME:FERREIMADescription:
APPROVED
Author: Ferreira, Michael Antonio
Advisor:
Wilson, SimonPublisher:
Trinity College Dublin. School of Computer Science & Statistics. Discipline of StatisticsType of material:
ThesisAvailability:
Full text availableKeywords:
Regression, Bayesian, Tree, Streaming, MCMCMetadata
Show full item recordLicences: