Robust Multimodal Turn-Taking for Human-Machine Interaction

O'Connor Russell, Samuel Arthur

Robust Multimodal Turn-Taking for Human-Machine Interaction

Files

Primary SR_thesis_postviva.pdf (11.24 MB)

Date

2026

Authors

O'Connor Russell, Samuel Arthur

Publisher

Trinity College Dublin. School of Engineering. Discipline of Electronic & Elect. Engineering

Access

openAccess

Citation

O'Connor Russell, Samuel Arthur, Robust Multimodal Turn-Taking for Human-Machine Interaction, Trinity College Dublin, School of Engineering, Electronic & Elect. Engineering, 2026

Abstract

During conversation, humans rapidly switch from speaking to listening and vice versa. This process is called turn-taking. Whilst second nature for humans, turn-taking presents a major challenge for emerging technologies such as voice assistants. At present, the majority of systems rely on detecting the silence after a speaking turn. This fails to replicate the fast-paced nature of human-human interaction. In recent years, predictive turn-taking models (PTTMs) have been proposed. Inspired by human turn-taking, PTTMs are neural networks which continuously predict whether or not a speaker change will occur in the near future. These models form an active area of research, and they have grown in sophistication and performance. This thesis addresses a number of under-explored areas in PTTM development. Currently, a major obstacle is PTTMs rely upon manual transcription of conversations; a timely and costly process. The thesis therefore offers a comprehensive overview of automatic speech recognition (ASR) for automated conversational speech transcription. Insights include a gap in the performance of commercial and open-source ASR on the transcription of non-lexical aspects of speech. We then explore PTTM training with ASR transcription, finding no difference in the performance of PTTMs trained using word alignments from ASR and manually transcribed interactions. This enables PTTM training on larger, more diverse multimodal datasets for which manual transcription is typically unavailable. PTTMs rely on speech alone to make their predictions, but turn human turn-taking is richly multimodal. The inclusion of visual cues such as gaze in PTTM has only been explored in a single, limited study. This major gap in the literature is addressed in this thesis by introducing the multimodal voice activity projection (MM-VAP) model, one of the first PTTMs to consider the inclusion of visual cues such as gaze, head pose and facial expression. We find that MM-VAP significantly outperforms the state-of-the-art audio-only approach: MM-VAP achieves an 83% balanced accuracy on the hold/shift prediction task, compared with 79% balanced accuracy in the audio-only state-of-the-art at the same noise level. Our evaluation is conducted across a comprehensive range of events, such as detecting the next speaker during silence and before overlapping speech. A further gap in the literature is that the performance of PTTMs in noise has not been considered, yet models will inevitably encounter noise once deployed. We address this in the thesis by adding environmental noises such as babble and music noise to the Candor videoconferencing corpus. We find that our MM-VAP model is not inherently robust to noise, with performance falling from 83% to 54% balanced accuracy in +10 dB music noise. However, we find that when we adapt the training of the model to include examples of noise, MM-VAP outperforms the audio-only state-of-the-art audio-only model by a wide margin, at 75% balanced accuracy in +10 dB speech noise compared with 64% balanced accuracy audio-only state-of-the-art at the same noise level. We demonstrate that the performance increase arises from the exploitation of visual cues in the MM-VAP model. However, we find that the performance increase does not always generalise to new sources of noise, highlighting the importance of the training process in PTTM development. The thesis also explores the role of multimodal cues within the speech signal. Recent PTTMs leverage self-supervised speech representations (S3Rs). These are learned representations as they capture prosodic, lexical and semantic aspects of the speech signal. However, it is difficult to establish which aspects of speech are exploited for prediction. We propose a vocoder-based methodology to selectively control the amount of prosodic and lexical information in speech. We find that the voice activity projection (VAP) model, an S3R-based turn-taking model from the literature, utilises both prosodic and lexical cues for prediction. However, when one feature is corrupted, the model can flexibly utilise the other without further training. A notable finding is that when speech is replaced with unintelligible noise following the prosodic contour of speech, performance remains above chance, at 69% balanced accuracy. This demonstrates that the prosodic contour alone is a powerful predictor of turn-taking patterns. We expand on this analysis by comparing the audio VAP model with large language model (LLM) based turn-taking models, which rely exclusively on text. We find that the LLM based models have a high number of false positive end-of-turn predictions. The audio-only turn-taking model does not suffer from the same errors, suggesting prosody plays a key role in turn-taking prediction. We therefore propose the Audio TurnGPT model, an LLM based turn-taking model which incorporates acoustic cues. We find that the model significantly improves prediction performance (90% vs. 85% balanced accuracy). The thesis therefore demonstrates the complementary, additive role of multimodal cues in turn-taking prediction. The best-performing models exploit all cues, but there is a degree of redundancy, as individual cues can perform as turn-taking predictors. Multimodality not only improves performance but also increases robustness to noise, and means models retain performance even when cues are removed. The analyses, methodology, and models introduced in this thesis represent significant advances in the predictive turn-taking literature. It is hoped that this thesis leads to an increased awareness of multimodality when working human interaction data. All modalities that are available to interlocutors in a dataset should be considered, as should their interactions with another and their influence on unfolding interactions. We also offer a discussion on future work and the implications of our findings.