Neural Turn-Taking Models for Spoken Dialogue Systems
Citation:
Roddy, Matthew, Neural Turn-Taking Models for Spoken Dialogue Systems, Trinity College Dublin.School of Engineering, 2021Download Item:

Abstract:
In order to simulate naturalistic turn-taking behaviours, such as fast-turn switches, intentional overlap, backchanneling, and barge-in, spoken dialogue systems (SDSs) will need to have computational models of turn-taking that are both predictive and incremental. They will need to be predictive in the sense that they predict future user turn-taking behaviours rather than respond to behaviours that have already occurred, as is typically done in traditional endpointing-based systems. In the projection theory of Sacks et al. (1974) they proposed that humans are capable of anticipating turn endings before they occur. We argue that SDSs which aim to converse in a human-like manner should be capable of anticipating user behaviours as well. To make decisions based on these predictions, the system must process information incrementally, while the user is still speaking.
In this thesis we develop recurrent neural network (RNN) based models of turn-taking that are both predictive and incremental. Continuous turn-taking (CTT) models as proposed by Skantze (2017) were taken as a starting point. We investigated these models and proposed a number of improvements and extensions. First, we performed an analysis of input features for CTT models, gained insights into the utility of different varieties of features, and proposed optimal sets. We then proposed architectural improvements to the original CTT model in the form of a multiscale RNN architecture that allows features to be processed at an independent rate. We then designed a control process based on partially observable Markov decision processes (POMDPs) that is able to employ the predictive nature of our RNN models to make responsive turn-taking decisions.
Our investigations led to the development of a different variety of model that can be used for generating naturalistic response timings using features from both the user's turn and the system turn. Our response timing networks (RTNets) are motivated by the observation that response timings carry communicative importance, and that listeners associate different timings with different types of responses. RTNets are still both predictive and incremental, but they differ from CTT models in many other aspects, such as their objective functions and architectures. We propose that these models address an overlooked aspect of SDS response generation that can increase the realism of SDS interactions.
Sponsor
Grant Number
Adapt Centre
Description:
APPROVED
Author: Roddy, Matthew
Advisor:
Harte, NaomiPublisher:
Trinity College Dublin. School of Engineering. Discipline of Electronic & Elect. EngineeringType of material:
ThesisCollections:
Availability:
Full text availableKeywords:
Machine Learning, Dialogue SystemsLicences: