Optimisation of the Largest Annotated Tibetan Corpus Combining Rule-based, Memory-based, and Deep-learning Methods
Citation:
Meelen, Marieke, Roux, Élie, Hill, Nathan, Optimisation of the Largest Annotated Tibetan Corpus Combining Rule-based, Memory-based, and Deep-learning Methods, ACM Transactions on Asian and Low-Resource Language Information Processing, 20, 1, 2021, 1-11Abstract:
This paper presents the new and improved version of the Annotated Corpus of Classical Tibetan (ACTib). These segmented and POS-tagged versions of all available texts in the Buddhist Digital Resource Center (BDRC) were annotated automatically using a memory-based tagger (see Meelen and Hill 2017). While this method had certain clear advantages - large amounts of data could quickly be split into meaningful words and grammatical markers, provided with highly detailed morpho-syntactic labels - the accuracy of these initial results can be improved in various ways. In this paper, we present a thorough error analysis and focus on correcting and improving these results using a combination of optimised memory-based, neural networks and rule-based methods.
Author's Homepage:
http://people.tcd.ie/hillnaDescription:
PUBLISHED
Author: Hill, Nathan
Publisher:
Association for Computing Machinery (ACM)Type of material:
Journal ArticleSeries/Report no:
ACM Transactions on Asian and Low-Resource Language Information Processing;20;
1;
Availability:
Full text availableDOI:
http://dx.doi.org/10.1145/3409488ISSN:
2375-4699Metadata
Show full item recordLicences: