Show simple item record

dc.contributor.advisorHarte, Naomien
dc.contributor.authorSterpu, Georgeen
dc.date.accessioned2021-07-05T15:25:52Z
dc.date.available2021-07-05T15:25:52Z
dc.date.issued2021en
dc.date.submitted2021en
dc.identifier.citationSterpu, George, Deep Cross-Modal Alignment in Audio-Visual Speech Recognition, Trinity College Dublin.School of Engineering, 2021en
dc.identifier.otherYen
dc.identifier.urihttp://hdl.handle.net/2262/96649
dc.descriptionAPPROVEDen
dc.description.abstractModern studies in cognitive psychology have demonstrated that speech perception is a multimodal process, as opposed to a purely auditory one with visual carryover as in the classic view. This led researchers to investigate the nature of the audio-visual speech integration process in the brain. The ability to combine the two sources of information delivering uncertain predictions improves the recognition of speech. In this thesis we aim to develop efficient machine learning algorithms and computational models of audio-visual speech recognition (AVSR) that learn to capitalise on the visual modality from examples. My original contribution to knowledge is an efficient strategy for the multimodal alignment and fusion of audio-visual speech on the task of large vocabulary continuous speech recognition. This strategy, termed AV Align, makes limited use of domain knowledge, but exploits the hypothesis that there is an underlying alignment between the higher order representations of the audio and visual modalities of speech. To achieve a controllable decoding latency, we develop a speech segmentation strategy termed Taris. This strategy aims to segment a spoken utterance by learning to count the number of words from speech data. Our multimodal systems are presented with audio and video recordings of speech from two large vocabulary audio-visual speech datasets, TCD-TIMIT and LRS2. We corrupt the audio channel with noise taken from a cafeteria environment at three signal to noise ratios. For each noise condition, we evaluate the character error rate of the multimodal system, and compare it to an equivalent audio-only system trained on the same data to assess the added benefit of the visual modality to speech recognition. We show empirically that AV Align discovers a monotonic trend in the alignment between the audio and visual modalities. This monotonicity is achieved while AV Align is allowed to search for a soft alignment across full speech utterances, without any supervision or constraints placed on the alignment pattern. On LRS2, the most challenging audio-visual speech dataset used in this work, AV Align obtains improvements over an audio-only system ranging from 6.4% under clean speech conditions up to around 31% at the highest level of audio noise. These improvements were made possible after an exploration of the learning difficulties specific to the audio-visual speech recognition task, which led us propose a multitask learning approach based on estimating the intensities of two facial action units from video. We also show that the word counting objective of Taris favours the segmentation of speech into units following a similar length distribution as the one of word units estimated with forced aligner. The correlation between our segments and the word units remains only speculative. Since we design the decoding process of Taris to be robust to segmentation imperfections, we achieve a comparable level of accuracy with equivalent systems that make full use of the utterance-level context and are indifferent to latency. Our findings reflect that we have discovered two well informed modelling assumptions contributing to the domain knowledge of audio-visual speech. The first one is the underlying higher order fusion of cross-modally aligned audio and visual speech representations. The second one is the possibility to learn the word count in a spoken utterance from either audio and audio-visual cues as a mechanism to segment transcribed speech lacking intermediate alignments. Both AV Align and Taris have objectives expressed as fully differentiable functions of the parameters. We believe these will be key ingredients to the adoption of audio-visual speech recognition technology into real products in the years to come.en
dc.publisherTrinity College Dublin. School of Engineering. Discipline of Electronic & Elect. Engineeringen
dc.rightsYen
dc.subjectSpeech Recognitionen
dc.subjectAudio-Visual Speech Recognitionen
dc.subjectMultimodal fusionen
dc.subjectOnline decodingen
dc.subjectDeep learningen
dc.subjectTarisen
dc.subjectAV Alignen
dc.titleDeep Cross-Modal Alignment in Audio-Visual Speech Recognitionen
dc.typeThesisen
dc.type.supercollectionthesis_dissertationsen
dc.type.supercollectionrefereed_publicationsen
dc.type.qualificationlevelDoctoralen
dc.identifier.peoplefinderurlhttps://tcdlocalportal.tcd.ie/pls/EnterApex/f?p=800:71:0::::P71_USERNAME:STERPUGen
dc.identifier.rssinternalid231921en
dc.rights.ecaccessrightsopenAccess
dc.contributor.sponsorScience Foundation Ireland (SFI)en


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record