Optimisation of the Largest Annotated Tibetan Corpus Combining Rule-based, Memory-based, and Deep-learning Methods

Hill, Nathan

This item is covered by a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Internationa. Click to find out more

File Type:

PDF

Item Type:

Journal Article

Date:

2021

Author:

Hill, Nathan

Access:

openAccess

Citation:

Meelen, Marieke, Roux, Élie, Hill, Nathan, Optimisation of the Largest Annotated Tibetan Corpus Combining Rule-based, Memory-based, and Deep-learning Methods, ACM Transactions on Asian and Low-Resource Language Information Processing, 20, 1, 2021, 1-11

Download Item:

(Accepted for publication (author's copy) - Peer Reviewed) 121.0Kb

Abstract:

This paper presents the new and improved version of the Annotated Corpus of Classical Tibetan (ACTib). These segmented and POS-tagged versions of all available texts in the Buddhist Digital Resource Center (BDRC) were annotated automatically using a memory-based tagger (see Meelen and Hill 2017). While this method had certain clear advantages - large amounts of data could quickly be split into meaningful words and grammatical markers, provided with highly detailed morpho-syntactic labels - the accuracy of these initial results can be improved in various ways. In this paper, we present a thorough error analysis and focus on correcting and improving these results using a combination of optimised memory-based, neural networks and rule-based methods.

URI:

http://hdl.handle.net/2262/102695

Author's Homepage:

http://people.tcd.ie/hillna

Description:

PUBLISHED

Author: Hill, Nathan

Publisher:

Association for Computing Machinery (ACM)

Type of material:

Journal Article

URI:

http://hdl.handle.net/2262/102695

Collections

Series/Report no:

ACM Transactions on Asian and Low-Resource Language Information Processing;
20;
1;

Availability:

Full text available

DOI:

http://dx.doi.org/10.1145/3409488

ISSN:

2375-4699

Metadata

Show full item record

Licences:

Original License

Browse

My Account