Measuring sentence parallelism using Mahalanobis distances: The NRC unsupervised submissions to the WMT18 Parallel Corpus Filtering shared task

Oct 1, 2018·

Patrick Littell

Samuel Larkin

Darlene Stewart

Michel Simard

Cyril Goutte

Chi-Kiu Lo

· 0 min read

Cite DOI URL

Abstract

The WMT18 shared task on parallel corpus filtering (Koehn et al., 2018b) challenged teams to score sentence pairs from a large high-recall, low-precision web-scraped parallel corpus (Koehn et al., 2018a). Participants could use existing sample corpora (e.g. past WMT data) as a supervisory signal to learn what a ``clean″ corpus looks like. However, in lower-resource situations it often happens that the target corpus of the language is the textitonly sample of parallel text in that language. We therefore made several unsupervised entries, setting ourselves an additional constraint that we not utilize the additional clean parallel corpora. One such entry fairly consistently scored in the top ten systems in the 100M-word conditions, and for one task—translating the European Medicines Agency corpus (Tiedemann, 2009)—scored among the best systems even in the 10M-word conditions.

Type

Conference paper

Publication

Proceedings of the Third Conference on Machine Translation: Shared Task Papers

Last updated on Oct 1, 2018

← Accurate semantic textual similarity for cleaning noisy parallel corpora using semantic machine translation evaluation metric: The NRC supervised submissions to the Parallel Corpus Filtering task Oct 1, 2018

EuroGames16: Evaluating Change Detection in Online Conversation May 1, 2018 →