Open Source for Machine Translation

Summary

The success of statistical machine translation systems such as Moses, Language Weaver and Google Translate has shown that it is possible to build high performance machine translation systems with a small amount of effort using statistical learning techniques.

This course will present the basic modeling behind statistical machine translation in a concise way. Participants will also learn how to use the Moses system, which is an open source toolkit for machine translation.

Instructor

Alexander Fraser

Email Address: SubstituteMyLastName@cis.uni-muenchen.de

CIS, LMU Munich

DFG Project: Models of Morphosyntax for Statistical Machine Translation

Schedule

October 10th Part 6. Translating to morphologically rich languages: case study on German
powerpoint slides
October 10th Part 5. Advanced topics in SMT. Discriminative bitext alignment, morphological processing, syntax
powerpoint slides
October 9th Part 4. Log-linear Models for SMT and Minimum Error Rate Training powerpoint slides
October 8th Part 3. Phrase-based Models and Decoding (automatically translating a text given an already learned model) powerpoint slides
October 7th Part 2. Bitext alignment (extracting lexical knowledge from parallel corpora) powerpoint slides
October 7th Part 1. Introduction, basics of statistical machine translation (SMT), evaluation of MT powerpoint slides


Further literature:

Philipp Koehn's book Statistical Machine Translation

Kevin Knight's tutorial on SMT (particularly look at IBM Model 1)

Koehn and Knight compound splitting paper. You can also take a look at Fritzinger and Fraser if you like.


Data

Data files you will need to run Moses using experiment.perl:

de-en tiny

new release of small German (with a better trigram language model)

(UPDATED) 50,000 sentences of German/English with trigram language model

BROKEN ALTE RECHTSCHREIBUNG 50,000 sentences of German/English with trigram language model

(UPDATED) 1.4 million sentences of German/English (about 1 GB uncompressed) with trigram language model

minitest de source

minitest en reference

Original config.toy from Moses

Updated config.toy

check.sh

run.sh

mteval-v13a.pl (replace the one in MOSES-1.0 with this one!)

Also install imagemagick, and perl-xml-twig (these are the install commands for Ubuntu):

sudo apt-get install imagemagick
sudo apt-get install libxml-twig-perl

One final note on using experiment.perl - this configuration file skips tuning (minimum error rate training). Tuning is time consuming because the decoder is run repeatedly. The configuration file instead uses weights which were precomputed and I have verified that these weights work well for the 50k europarl dataset.

german_text.tok.vcb for compound splitting