Exercise 2 - OmegaT and IBM Model 1
Alex Fraser
Please write up your results and send them to me (as a PDF) by email
by Friday May 26th at 15:00. Name the PDF file yourlastname_yourfirstname_ex2.pdf
(e.g., fraser_alex_ex2.pdf). Include your name, your Matrikelnummer, and
your email address. Please also attach to the email file a zip file
which contains the mytest-omegat.tmx file you will create below,
call this yourlastname_yourfirstname_ex2.zip
First part: Google Translate again
Second part: do a small translation job using OmegaT
Third part: do a basic exercise and discuss some basic questions about Model1 (and, optionally, some harder questions)
Google Translate
- In this part we will again look at Google Translate for the sentences you corrected in exercise 1.
- Take the 5 sentences for which you got bad output from Google Translate. Translate them again.
- Do you get the same output as before? Or were your corrections partially or fully adopted?
OmegaT
- Download OmegaT from omegat.org
- Create a new project (see the "instant start" guide to OmegaT, Chapter 2 of the manual, you can find a direct link in Google), call the project "mytest" (without quotes). Make the source language be DE-DE. Make the target language EN-US or EN-GB (depending on whether you prefer to write in American or British English). Make a note of where the project was created (the path on disk).
- Go to the main directory of the project, then the source subdirectory of the project and create a text file called "text1.txt" containing 5 sentences in German (you could use the ones from the Google Translate exercise if you have them). Make sure to use proper punctuation, OmegaT knows how to segment German sentences, so don't separate commas from words for instance.
- Run OmegaT, and load the project. You should see the 5 sentences, which are queued up for translation. Click on the target part of each one, and enter the translation in English.
- Select "generate translations" (the hotkey is control-D) to get OmegaT to output its database of translation to the target subdirectory
- Save and Exit OmegaT
- The results of your work are stored in the "target" subdirectory, using the same filename. Check the file there to make sure that the output looks OK.
- Go back to the source subdirectory of the project and create another text file "text2.txt". For the first sentence, take the same first German sentence as you used before (i.e., the first sentence in text1.txt). Add 3 new sentences, these should be similar to sentences two to four in the first file, change just one word per sentence.
- Run OmegaT, and load the project. You should see the 4 sentences. The first sentence should be an exact match. Accept this. Then click on the second sentence. You should see a "fuzzy match" to the right. Use right click to get to "Replace translation with match". Then edit it. Finish editing these sentences.
- Select "generate translations" (the hotkey is control-D) to get OmegaT to output its database of translations to the target subdirectory
- Save and Exit OmegaT
- IMPORTANT: look at the mytest-omegat.tmx file located in the main project directory and discuss its contents. What is this file for? How should you modify it if you switch language directions (translating German to English)? How much support for segmenting and fuzzy matching is there in German or other languages that interest you (see the OmegaT manual)? Compare this with support for segmentation and fuzzy mapping in English.
- The most popular commercial tool like OmegaT is Trados.
Model 1
Pseudo-code from Philipp Koehn's book.
Pseudo-code of EM for IBM Model 1:
initialize t(e|f) uniformly
do until convergence
set count(e|f) to 0 for all e,f
set total(f) to 0 for all f
for all sentence pairs (e_s,f_s)
set total_s(e) = 0 for all e
for all words e in e_s
for all words f in f_s
total_s(e) += t(e|f)
for all words e in e_s
for all words f in f_s
count(e|f) += t(e|f) / total_s(e)
total(f) += t(e|f) / total_s(e)
for all f
for all e
t(e|f) = count(e|f) / total(f)
Basic Exercise
Start by convincing yourself that the incredibly simple estimation you do by running the main loop of the pseudo-code once gives the same results as explicitly enumerating the alignments in slide 42 (the slide where we calculated counts by working on four alignment functions by explicitly enumerating each one). You have to start with the t values on slide 42 to do this, and you apply them to just the pair of two word sentences on slide 42.
Basic Questions about Model 1
- What is the alignment structure modeled by IBM Model 1 in the pseudo-code presented above? Is the structure symmetric with respect to English and Foreign?
- How many entries does t(e|f) have after the initialization (line 1 of the pseudo-code)?
- Can you think of a way to initialize that would involve setting some of the parameters in t(e|f) to zero or any other constant without affecting the results? Remember that if N is the number of English types, then t(e|f)=1/N for all e and f. Think about whether any of the entries in t will not be used.
- Under what conditions will an English word e in a particular sentence pair be left unaligned in the Viterbi alignment? What about a French word f?
- Under what circumstances would we prefer that an English word e is unaligned (note that this question is about gold standard word alignment, not modeling)?
Advanced Questions about Model 1 (Optional)
- Suppose you are given Model 1 parameters estimated by someone else. What is a short formula which determines the Viterbi alignment of a fixed sentence pair E and F?
- How could we force cognates (for a language pair like French/English) to be aligned correctly? (Warning, this is a trick question)
- Is there some simple way (either heuristically or by modifying the model; either one is fine) where we
could break the independence assumption in Model 1 and allow the
alignment of a word at position j to be influenced by the word at
position j-1 (of the Foreign side)?
- Look at the "grow" heuristic in the slides. If you know this will
be used on a pair of 1-to-N and M-to-1 alignments, is it possible to
systematically remove links from one of these alignments (for the sake of discussion assume the M-to-1 alignment) without affecting the final symmetrized alignment?