Identifying translation effects in English natural language text

Lynch, Gerard

With the rise in popularity of applying machine learning methods to problems in textual stylometry, the increased availability of machine-readable corpora and the emerging benefits of research on corpora of translated text in the field of machine translation, there has been a corresponding increase in interest in the analysis of translated text by computational linguists, a subject which until recent years remained the preserve of translation studies scholars. This thesis details the state-of-the art in research comprising the fields of computational linguistics, translation studies and the digital humanities and describes experiments carried out using machine-learning tools on a selection of comparable corpora of translations in English with regard to three main research questions: defining markers of translated vs. original text in the same genre, obtaining source language markers in literary translations and the detection of the stylistic traces of a literary translator. Supervised learning experiments are carried out on a number of comparable corpora of translated text, with a focus on identifying features which capture the range of translation effects mentioned. The features used in this thesis are ngram-based, consisting of ngrams of words and parts-of-speech, and document-level, which consist of the frequencies of a class of textual items and various other metrics including type-token ratios and readability scores. Chapter 4 describes experiments on two sets of comparable corpora in English, the Europarl corpus and a corpus of translated and original articles from the online version of the New York Times, with the goal of mining features of translated language, or translationese. Support Vector Machines are used along with Naive Bayes and Simple Logistic classifiers on these corpora, with the task of classifying the translated side of the corpora from the non-translated side. Classification accuracy was circa. 80% for the Europarl corpus and slightly less for the NYT corpus, using a mixed feature set of the features mentioned above. The different genres of the corpora resulted in generally non-intersecting distinguishing feature sets for each corpus, however there were a small number of features in common. Classifiers which were trained on Europarl and tested on the NYT corpus reported poor results, which corroborated results from the literature by Koppel and Ordan (2011) on different dialects of translationese. Chapter 5 tackles the question of source language detection in translations as examined in the Europarl corpus by van Halteren (2008). The corpus focused on here is a corpus of literary text from the nineteenth century, comprising of texts translated from German, French and Russian, with English original texts also included in the experiments. Using comparable experimental methodology to the previous chapter, classifiers were trained on the corpus, with the task of classifying the source language of a text, a four or three class classification problem. Accuracy results varied from 99% using a feature set of the 500 most distinguishing word unigrams to 85% for a feature set containing document metrics, POS bigrams and common words. This classifier was also tested on a separate but comparable set of texts from the same literary period in order to examine the classifier performance on unknown data, a drop of ca. 20% in classification accuracy was observed in the three-language experiment and the four-language experiment, although results were still significantly higher than the baseline in both cases. Chapter 6 focused on the question of mining distinguishing features of translator style using the same approach as previous chapters, both in parallel translations of the same text and in a corpus of translations of different texts from the same playwright by each of the translators examined. This represented a novel approach towards detecting stylistic characteristics of a translator’s writing. The playwright in question was the Norwegian nineteenth century author Henrik Ibsen and the two translators were William Archer and R. Farquharson Sharp. High accuracy (90%) was obtained using feature sets containing only one feature-type in ten-fold cross validation experiments on the parallel translations. A classifier consisting of document-level feature sets only was trained on the larger corpus of non-parallel translations and tested on the parallel translation set. 80% accuracy was obtained for the task of determining the translator of each of the two parallel translations of the same play, indicating that each translator maintained a distinguishable textual style across all of his translations of the playwright in question. Features included the frequency of contracted forms in English, the use of different verb forms in the translation of stage directions, and metrics such as average sentence length and type-token ratio. Sharp used the word because and a number of other common words significantly more than Archer in his parallel translation, these were investigated with reference to the original source coupled with a diachronic English corpus, to determine whether this phenomenon was a marker of the style of a particular translator or had other origins. Chapter 7 focuses on commonalities over the three experiments, including document-level and ngram features which are found to be distinguishing in more than one experiment, such as the Coleman-Liau index and the ratio of nouns to total words, along with suggestions for future experimentation.

Identifying translation effects in English natural language text

File Type:

Item Type:

Date:

Author:

Access:

Citation:

Download Item:

Abstract:

URI:

Advisor:

Qualification name:

Publisher:

Note:

Type of material:

URI:

Collections

Availability:

Keywords:

Metadata

Browse

My Account

Identifying translation effects in English natural language text

File Type:

Item Type:

Date:

Author:

Access:

Citation:

Download Item:

Abstract:

URI:

Advisor:

Qualification name:

Publisher:

Note:

Type of material:

URI:

Collections

Availability:

Keywords:

Metadata