Show simple item record

dc.contributor.advisorHarte, Naomien
dc.contributor.authorPandey, Ayushien
dc.date.accessioned2024-04-25T08:20:19Z
dc.date.available2024-04-25T08:20:19Z
dc.date.issued2024en
dc.date.submitted2024en
dc.identifier.citationPandey, Ayushi, Segmental evaluation of Text-to-Speech synthesis, Trinity College Dublin, School of Engineering, Electronic & Elect. Engineering, 2024en
dc.identifier.otherYen
dc.identifier.urihttp://hdl.handle.net/2262/108296
dc.descriptionAPPROVEDen
dc.description.abstractAdvancements in speech synthesis technology have mandated the need for reliable methods for its evaluation. Present day evaluation, dominated by subjective listening tests, provides at best, a general overall picture of the perceived speech quality. It does not provide information about the relationship between acoustic parameters, and their contribution to perceived attributes of synthetic speech such as naturalness, similarity and pleasantness. Naturalness in particular, which is a widely used standard in synthetic speech evaluation, is often underspecified. It has also been reported that factors like modified instructions, contextual framing, or user-expectations with the application of synthetic speech can influence the ratings of naturalness. However, we see evidence of consistent listener agreement on their ratings of naturalness, in multiple studies of synthetic speech evaluation. This leads us to hypothesize, that there may be information in the acoustic signature of TTS signals that the listeners exploit to make a judgment on naturalness. The primary goal of this thesis is to use contrastive properties of speech segments present in corpora of synthetic speech for evaluating the naturalness of synthetic speech. The concept of naturalness has been discussed as a multi-faceted perceptual attribute. The scope of this thesis is limited to one aspect: the human-likeness of TTS voices. We have selected the Blizzard 2013 corpus for our analysis, because it provides parallel TTS data over a wide selection of HMM, Unit-selection, Hybrid and more recently neural TTS techniques. Contrastive features of vowels and obstruent consonants are extracted using standard acoustic-phonetic and corpus phonetics techniques. Features of each synthetic voice are compared with the human voice, which is held as the reference. Then, a new subjective evaluation framework is proposed which complements the diagnostic nature of the segmental analysis. Our results show that segmental evaluation can be used to provide diagnostic information that is often missed by traditional subjective tests. In non-neural systems, we find features of obstruent consonants such spectral tilt and RMS amplitude can be useful for identifying quality-differences between systems and groups of systems. Additionally, vowel features such as within-category dispersion showed an above chance correlation (rho = 0.64) with the perceived MOS. Next, we show that segmental evaluation can be extended successfully to evaluating modern, neural TTS synthesizers. First, we find that neural TTS performs very well in modelling vowels, and has improved over several features of the older, non-neural TTS synthesizers. Only a few features like F0 onset and spectral tilt show statistically significant deviations from the human voice. However, features of voiceless obstruents were found to be distorted, i.e, they deviated significantly from the reference human voice. This is one of the major findings of this thesis. We also investigate the perceptual significance of the deviation in obstruents, through a novel subjective evaluation design. The study involved presenting stimuli of varying lengths to 128 participants, who were asked to identify whether each stimulus was produced by a human or a machine. We hypothesized that the length of the stimuli would aid in more accurate detection between human and machine stimuli. The participants' responses were captured using a 2-alternative forced choice task, and were analyzed using a logistic regression. In obstruent-rich stimuli, we indeed found a 22.37% increase in accuracy as length increased with strongly significant effects (p-val < 0.001). The findings in this thesis can be used to provide localized insights into feature distortion, and can be extended to provide real-time feedback for TTS engineers. These findings also highlight the usefulness of phonetics in TTS technology, and enable greater interaction between the communities.en
dc.publisherTrinity College Dublin. School of Engineering. Discipline of Electronic & Elect. Engineeringen
dc.rightsYen
dc.subjectWaveNeten
dc.subjectWaveGANen
dc.subjectVoiceless regionsen
dc.subjectDistortionen
dc.subjectSegmental evaluation of TTSen
dc.subjectText-to-Speech evauationen
dc.subjectNeural TTSen
dc.subjectLinguistics in TTSen
dc.subjectPhonetics in TTSen
dc.subjectPsychophysicsen
dc.subjectPerceptual evaluation of TTSen
dc.subjectWeber-Fechner Lawen
dc.subjectObstruentsen
dc.subjectObstruent distortionen
dc.subjectVoiceless distortion in TTSen
dc.titleSegmental evaluation of Text-to-Speech synthesisen
dc.typeThesisen
dc.type.supercollectionthesis_dissertationsen
dc.type.supercollectionrefereed_publicationsen
dc.type.qualificationlevelDoctoralen
dc.identifier.peoplefinderurlhttps://tcdlocalportal.tcd.ie/pls/EnterApex/f?p=800:71:0::::P71_USERNAME:PANDEYAen
dc.identifier.rssinternalid265330en
dc.rights.ecaccessrightsopenAccess


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record