Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus

Sponsored by Science Foundation Ireland (SFI)

File Type:

PDF

Item Type:

Conference Paper

Date:

2018

Author:

Vogel, Carl

Moreau, Erwan

Access:

openAccess

Citation:

Moreau, E. & Vogel, C., Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018, Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, H?l?ne Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga, European Language Resources Association (ELRA), 2018, 1119-1127

Download Item:

LREC2018-MV-1072.pdf (PDF) 130.5Kb

Abstract:

This paper describes how a tokenizer can be trained from any dataset in the Universal Dependencies 2.1 corpus (UD2) (Nivre et al., 2017). A software tool, which relies on Elephant (Evang et al., 2013) to perform the training, is also made available. Beyond providing the community with a large choice of language-specific tokenizers, we argue in this paper that: (1) tokenization should be considered as a supervised task; (2) language scalability requires a streamlined software engineering process across languages.

URI:

http://www.lrec-conf.org/proceedings/lrec2018/summaries/1072.html
http://hdl.handle.net/2262/91610

Sponsor

Grant Number

Science Foundation Ireland (SFI)

13/RC/2106

Author's Homepage:

http://people.tcd.ie/vogel
http://people.tcd.ie/moreaue

Author: Vogel, Carl; Moreau, Erwan

Other Titles:

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Publisher:

European Language Resources Association (ELRA)

Type of material:

Conference Paper

URI:

http://www.lrec-conf.org/proceedings/lrec2018/summaries/1072.html
http://hdl.handle.net/2262/91610

Collections:

Availability:

Full text available

Keywords:

Universal dependencies, Word segmentation, Tokenization, Multilinguality, Interoperability

Subject (TCD):

Digital Humanities , Computational linguistics

Show full item record

Licences:

Original License

Browse

All of TARA

This Collection

Statistics