Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus
File Type:
PDFItem Type:
Conference PaperDate:
2018Access:
openAccessCitation:
Moreau, E. & Vogel, C., Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018, Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, H?l?ne Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga, European Language Resources Association (ELRA), 2018, 1119-1127Download Item:
LREC2018-MV-1072.pdf (PDF) 130.5Kb
Abstract:
This paper describes how a tokenizer can be trained from any dataset in the Universal Dependencies 2.1 corpus (UD2) (Nivre et al.,
2017). A software tool, which relies on Elephant (Evang et al., 2013) to perform the training, is also made available. Beyond providing
the community with a large choice of language-specific tokenizers, we argue in this paper that: (1) tokenization should be considered as
a supervised task; (2) language scalability requires a streamlined software engineering process across languages.
URI:
http://www.lrec-conf.org/proceedings/lrec2018/summaries/1072.htmlhttp://hdl.handle.net/2262/91610
Sponsor
Grant Number
Science Foundation Ireland (SFI)
13/RC/2106
Author's Homepage:
http://people.tcd.ie/vogelhttp://people.tcd.ie/moreaue
Author: Vogel, Carl; Moreau, Erwan
Other Titles:
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)Publisher:
European Language Resources Association (ELRA)Type of material:
Conference PaperURI:
http://www.lrec-conf.org/proceedings/lrec2018/summaries/1072.htmlhttp://hdl.handle.net/2262/91610
Collections:
Availability:
Full text availableKeywords:
Universal dependencies, Word segmentation, Tokenization, Multilinguality, InteroperabilitySubject (TCD):
Digital Humanities , Computational linguisticsLicences: